Closed jmunroe closed 2 weeks ago
I had to request a quota increase to accommodate the size of the second filestore, so let's see how long that takes to come back.
Quota increase approved
Because AWI-CIROH are running in their own GCP project, I do not have the right access permissions to SSH into the VM I created to mount the 2 filestores and copy the data over.
Insufficient IAM permissions. The instance belongs to an external organization. You must be granted the roles/compute.osLoginExternalUser IAM role on the external organization to configure POSIX account information.
I worked around the above restriction by modifying our deployer exec root-homes
command to mount the two filestores. The data copying has begun!
Community has revised their request to be 3TB.
I am having problems with my workaround modifying the deployer exec root-homes
command to mount both filestores. The pod seems to be getting killed long before the transfer has completed.
Community has revised their request to be 3TB.
I will fulfill this just because the file transfer has not been going smoothly so far.
I set resource limits on the pod as well as a node selector and toleration to force the pod onto the largest node AWI-CIROH has available. It seems that the file transfer is moving much quicker now.
I still haven't managed to get the rsync command to complete, probably due to running out of resources for the pod. Also seem to be getting close to filling the capacity of the new filestore so have to make decisions about deleting stuff. I'm not sure how to push this forward.
Hi @sgibson91 .
What is the current usage of data that we are trying to transfer?
I don't think you should be alone in this one. There is an interaction with AWI-CIROH that might need to happen -- if they are already using more than 3TB especially!
Is rsync the correct technology to use here? This sounds like an example of where Globus endpoints are the designed for. (Globus is on my mind while I am at USRSE)
Let's make sure this issue is discussed in our iteration planning meeting tomorrow. It definitely sounds like it is much more complicated than originally expected.
I have no idea what Globus is or how to use it, whereas the engineering team have extensively documented rsync in these situations.
I'm using a glob pattern to transfer. I'm not doing anything complicated or elegant to ignore files.
The problem is:
This issue is separate to how much data we should be transferring and what.
My edited version of the root_homes
function in the infra_components.py
file under deployer/commands/exec
:
@exec_app.command()
def root_homes(
cluster_name: str = typer.Argument(..., help="Name of cluster to operate on"),
hub_name: str = typer.Argument(..., help="Name of hub to operate on"),
extra_nfs_server: str = typer.Argument(
None, help="IP address of an extra NFS server to mount"
),
extra_nfs_base_path: str = typer.Argument(
None, help="Path of the extra NFS share to mount"
),
extra_nfs_mount_path: str = typer.Argument(
None, help="Mount point for the extra NFS share"
),
):
"""
Pop an interactive shell with the entire nfs file system of the given cluster mounted on /root-homes
Optionally mount an extra NFS share if required (useful when migrating data to a new NFS server).
"""
config_file_path = find_absolute_path_to_cluster_file(cluster_name)
with open(config_file_path) as f:
cluster = Cluster(yaml.load(f), config_file_path.parent)
with cluster.auth():
hubs = cluster.hubs
hub = next((hub for hub in hubs if hub.spec["name"] == hub_name), None)
if not hub:
print_colour("Hub does not exist in {cluster_name} cluster}")
return
server_ip = base_share_name = ""
for values_file in hub.spec["helm_chart_values_files"]:
if "secret" not in os.path.basename(values_file):
values_file = config_file_path.parent.joinpath(values_file)
config = yaml.load(values_file)
if config.get("basehub", {}):
config = config["basehub"]
server_ip = config.get("nfs", {}).get("pv", {}).get("serverIP", server_ip)
base_share_name = (
config.get("nfs", {})
.get("pv", {})
.get("baseShareName", base_share_name)
)
pod_name = f"{cluster_name}-root-home-shell"
volumes = [
{
"name": "root-homes",
"nfs": {"server": server_ip, "path": base_share_name},
},
{
"name": "root-homes-2",
"nfs": {"server": "10.146.96.42", "path": "/homes/"},
},
]
volume_mounts = [
{
"name": "root-homes",
"mountPath": "/root-homes",
},
{
"name": "root-homes-2",
"mountPath": "/root-homes-2",
},
]
if extra_nfs_server and extra_nfs_base_path and extra_nfs_mount_path:
volumes.append(
{
"name": "extra-nfs",
"nfs": {"server": extra_nfs_server, "path": extra_nfs_base_path},
}
)
volume_mounts.append(
{
"name": "extra-nfs",
"mountPath": extra_nfs_mount_path,
}
)
pod = {
"apiVersion": "v1",
"kind": "Pod",
"spec": {
"terminationGracePeriodSeconds": 1,
"automountServiceAccountToken": False,
"volumes": volumes,
"nodeSelector": {
"node.kubernetes.io/instance-type": "n2-highmem-16",
},
"tolerations": [
{
"key": "hub.jupyter.org_dedicated",
"operator": "Equal",
"value": "user",
"effect": "NoSchedule",
}
],
"containers": [
{
"name": pod_name,
# Use ubuntu image so we get better gnu rm
"image": UBUNTU_IMAGE,
"stdin": True,
"stdinOnce": True,
"tty": True,
"volumeMounts": volume_mounts,
"resources": {
"requests": {"memory": "90G", "cpu": 12},
"limits": {"memory": "100G"},
},
}
],
},
}
cmd = [
"kubectl",
"-n",
hub_name,
"run",
"--rm", # Remove pod when we're done
"-it", # Give us a shell!
"--overrides",
json.dumps(pod),
"--image",
# Use ubuntu image so we get GNU rm and other tools
# Should match what we have in our pod definition
UBUNTU_IMAGE,
pod_name,
"--",
"/bin/bash",
"-l",
]
with cluster.auth():
subprocess.check_call(cmd)
Though this last is an assumption since the pod is cleaned up too quickly for me to see any logs.
I think sometimes you can see details about this even for a deleted pod, by doing kubectl get event -n <namespace>
and looking. k8s Event resources are cleaned up after one hour though, so this only provides some additional time.
After that, its possible to see details about this in cloud logs with a query like below (taken from a sample query in google docs):
resource.type="k8s_cluster" AND
log_id("events")
I'll dig into this now
I got "no space left on device" now =/
The current awi-ciroh filestore use has grown again to use 90% of the 9TiB, so me copying to the 3TiB filestore fails with "no space left on device" which makes sense. This is showing a few weeks back:
@jmunroe I think at this point we need a discussion with the community about what they actually want copying over if they want to stick with 3TB.
I agree ... I'm writing that email now.
Thank you @jmunroe!
I saw that we can accomplish something similar to what you did with hardcoded changes @sgibson91 by doing this, where the three last arguments are about an extra nfs server mount.
deployer exec root-homes awi-ciroh prod 10.146.96.42 "/homes/" "/root-homes-2"
That left the following changes in the k8s Pod manifest that I also used:
"nodeSelector": {
"cloud.google.com/gke-nodepool": "nb-nfs-copy-node",
},
"tolerations": [
{
"key": "hub.jupyter.org_dedicated",
"operator": "Equal",
"value": "user",
"effect": "NoSchedule",
}
],
# ...
# inside a container...
"resources": {
"requests": {"memory": "27Gi", "cpu": 7},
"limits": {"memory": "32Gi", "cpu": 8},
},
In awi-ciroh's terraform.tfvars, I added this:
notebook_nodes = {
# FIXME: To be deleted when NFS copy operation is done
#
# Added by Erik handling https://github.com/2i2c-org/infrastructure/issues/4844,
# for use with Sarah's adjustment in https://github.com/2i2c-org/infrastructure/issues/4844#issuecomment-2416749438
#
"nfs-copy-node" : {
min : 1,
max : 1,
machine_type : "n4-standard-8",
disk_type : "hyperdisk-balanced",
disk_size_gb : 4000,
},
# ...
There has been an update on the freshdesk ticket and they have cleaned up some data so we can proceed with the 3TB. I will pick this back up again and aim to complete before my AL starts at the end of Wednesday.
Transfer started
Argh, the pod still got deleted before completion!
I created the new nodepool and used the taints/tolerations/resources Erik documented above :/
I tried bumping the nfs-copy-node instance to a n4-standard-16 and the pod still got cancelled with a 4TB boot disk and following resources
"resources": {
"requests": {"memory": "54Gi", "cpu": 15},
"limits": {"memory": "64Gi", "cpu": 16},
},
When we first migrated the AWI-CIROH filestore, @yuvipanda was granted the permission to SSH into VMs he created so the migration could happen the traditional way and he was granted it. I've asked him who we should contact so that Erik, Georgiana and I can be given the same permissions. We can then do this job properly, rather than special-casing. This person's name may be Ben?
EDIT: From the freshdesk ticket, this person may well be Ben Lee.
I have responded on the ticket requesting that Georgiana, Erik and I have our permissions upgraded to match Yuvi's. That way we should all be able to ssh into a VM that 1) hopefully won't be subject to whatever is killing the k8s pod, and 2) the process can be monitored and completed by someone not me when I go on AL.
Permissions have been granted. I shall try with a VM.
While Ben said he had given me (us) permissions, I still can't ssh into a VM. I followed up in the freshdesk ticket last night, but no response as of yet.
I'm going to unassign myself from this one as there's no way I can make progress before my AL.
2) the process can be monitored and completed by someone not me
I actually don't think this would've been true anyway, as the process would've been running under my user on the VM. But at least it wouldn't have been on my local laptop!
I've picked this up today
This migration is now completed (I'll provide more info + next steps on Monday).
I've quickly let the community know. If they don't report any issues by wednesday, we can decomission the old filestore on thursday and call this done.
This one lingered awhile - should we have a short wash up/retro about where we got stuck, so we can learn from it?
we definitely should!
https://github.com/2i2c-org/meta/issues/1626 tracks retrospective, and https://github.com/2i2c-org/meta/issues/1627 tracks decomissioning the old filestore. Am going to close this one now.
The Freshdesk ticket link
https://2i2c.freshdesk.com/a/tickets/2170
Ticket request type
Other
Ticket impact
🟨 Medium
Short ticket description
The filestore size for awi-ciroh has decreased from 9TB to 4.7TB. To avoid charges for unused storage, the community wants to update the filestore capacity to ~5TB~ 3TB.
(Optional) Investigation results
Since this require creating a new filestore and transferring data over plus coordinating with the community, I think it should be schedule in the next engineering sprint.