[Support] Reduce awi-ciroh Filestore to 3 TB

jmunroe commented 2 months ago

The Freshdesk ticket link

https://2i2c.freshdesk.com/a/tickets/2170

Ticket request type

Other

Ticket impact

🟨 Medium

Short ticket description

The filestore size for awi-ciroh has decreased from 9TB to 4.7TB. To avoid charges for unused storage, the community wants to update the filestore capacity to ~5TB~ 3TB.

(Optional) Investigation results

Since this require creating a new filestore and transferring data over plus coordinating with the community, I think it should be schedule in the next engineering sprint.

sgibson91 commented 2 months ago

I had to request a quota increase to accommodate the size of the second filestore, so let's see how long that takes to come back.

sgibson91 commented 2 months ago

Quota increase approved

sgibson91 commented 2 months ago

Because AWI-CIROH are running in their own GCP project, I do not have the right access permissions to SSH into the VM I created to mount the 2 filestores and copy the data over.

Insufficient IAM permissions. The instance belongs to an external organization. You must be granted the roles/compute.osLoginExternalUser IAM role on the external organization to configure POSIX account information.

sgibson91 commented 2 months ago

I worked around the above restriction by modifying our deployer exec root-homes command to mount the two filestores. The data copying has begun!

jmunroe commented 2 months ago

Community has revised their request to be 3TB.

sgibson91 commented 2 months ago

I am having problems with my workaround modifying the deployer exec root-homes command to mount both filestores. The pod seems to be getting killed long before the transfer has completed.

sgibson91 commented 2 months ago

Community has revised their request to be 3TB.

I will fulfill this just because the file transfer has not been going smoothly so far.

sgibson91 commented 2 months ago

I set resource limits on the pod as well as a node selector and toleration to force the pod onto the largest node AWI-CIROH has available. It seems that the file transfer is moving much quicker now.

sgibson91 commented 1 month ago

I still haven't managed to get the rsync command to complete, probably due to running out of resources for the pod. Also seem to be getting close to filling the capacity of the new filestore so have to make decisions about deleting stuff. I'm not sure how to push this forward.

jmunroe commented 1 month ago

Hi @sgibson91 .

What is the current usage of data that we are trying to transfer?

I don't think you should be alone in this one. There is an interaction with AWI-CIROH that might need to happen -- if they are already using more than 3TB especially!

Is rsync the correct technology to use here? This sounds like an example of where Globus endpoints are the designed for. (Globus is on my mind while I am at USRSE)

Let's make sure this issue is discussed in our iteration planning meeting tomorrow. It definitely sounds like it is much more complicated than originally expected.

sgibson91 commented 1 month ago

I have no idea what Globus is or how to use it, whereas the engineering team have extensively documented rsync in these situations.

I'm using a glob pattern to transfer. I'm not doing anything complicated or elegant to ignore files.

sgibson91 commented 1 month ago

The problem is:

because AWI-CIROH is on an institutional cloud account, I don't have the correct permissions to SSH into a VM
instead of using a VM, I have created a k8s pod with both filestores are mounted. I have put it on the largest node available with as much memory reserved as possible allowing the pod to still start. However, the pod keeps getting killed before the end of the rsync process, I assume because it's out of memory. Though this last is an assumption since the pod is cleaned up too quickly for me to see any logs.

This issue is separate to how much data we should be transferring and what.

sgibson91 commented 1 month ago

My edited version of the root_homes function in the infra_components.py file under deployer/commands/exec:

@exec_app.command()
def root_homes(
    cluster_name: str = typer.Argument(..., help="Name of cluster to operate on"),
    hub_name: str = typer.Argument(..., help="Name of hub to operate on"),
    extra_nfs_server: str = typer.Argument(
        None, help="IP address of an extra NFS server to mount"
    ),
    extra_nfs_base_path: str = typer.Argument(
        None, help="Path of the extra NFS share to mount"
    ),
    extra_nfs_mount_path: str = typer.Argument(
        None, help="Mount point for the extra NFS share"
    ),
):
    """
    Pop an interactive shell with the entire nfs file system of the given cluster mounted on /root-homes
    Optionally mount an extra NFS share if required (useful when migrating data to a new NFS server).
    """
    config_file_path = find_absolute_path_to_cluster_file(cluster_name)
    with open(config_file_path) as f:
        cluster = Cluster(yaml.load(f), config_file_path.parent)

    with cluster.auth():
        hubs = cluster.hubs
        hub = next((hub for hub in hubs if hub.spec["name"] == hub_name), None)
        if not hub:
            print_colour("Hub does not exist in {cluster_name} cluster}")
            return

    server_ip = base_share_name = ""
    for values_file in hub.spec["helm_chart_values_files"]:
        if "secret" not in os.path.basename(values_file):
            values_file = config_file_path.parent.joinpath(values_file)
            config = yaml.load(values_file)

            if config.get("basehub", {}):
                config = config["basehub"]

            server_ip = config.get("nfs", {}).get("pv", {}).get("serverIP", server_ip)
            base_share_name = (
                config.get("nfs", {})
                .get("pv", {})
                .get("baseShareName", base_share_name)
            )

    pod_name = f"{cluster_name}-root-home-shell"
    volumes = [
        {
            "name": "root-homes",
            "nfs": {"server": server_ip, "path": base_share_name},
        },
        {
            "name": "root-homes-2",
            "nfs": {"server": "10.146.96.42", "path": "/homes/"},
        },
    ]
    volume_mounts = [
        {
            "name": "root-homes",
            "mountPath": "/root-homes",
        },
        {
            "name": "root-homes-2",
            "mountPath": "/root-homes-2",
        },
    ]

    if extra_nfs_server and extra_nfs_base_path and extra_nfs_mount_path:
        volumes.append(
            {
                "name": "extra-nfs",
                "nfs": {"server": extra_nfs_server, "path": extra_nfs_base_path},
            }
        )
        volume_mounts.append(
            {
                "name": "extra-nfs",
                "mountPath": extra_nfs_mount_path,
            }
        )

    pod = {
        "apiVersion": "v1",
        "kind": "Pod",
        "spec": {
            "terminationGracePeriodSeconds": 1,
            "automountServiceAccountToken": False,
            "volumes": volumes,
            "nodeSelector": {
                "node.kubernetes.io/instance-type": "n2-highmem-16",
            },
            "tolerations": [
                {
                    "key": "hub.jupyter.org_dedicated",
                    "operator": "Equal",
                    "value": "user",
                    "effect": "NoSchedule",
                }
            ],
            "containers": [
                {
                    "name": pod_name,
                    # Use ubuntu image so we get better gnu rm
                    "image": UBUNTU_IMAGE,
                    "stdin": True,
                    "stdinOnce": True,
                    "tty": True,
                    "volumeMounts": volume_mounts,
                    "resources": {
                        "requests": {"memory": "90G", "cpu": 12},
                        "limits": {"memory": "100G"},
                    },
                }
            ],
        },
    }

    cmd = [
        "kubectl",
        "-n",
        hub_name,
        "run",
        "--rm",  # Remove pod when we're done
        "-it",  # Give us a shell!
        "--overrides",
        json.dumps(pod),
        "--image",
        # Use ubuntu image so we get GNU rm and other tools
        # Should match what we have in our pod definition
        UBUNTU_IMAGE,
        pod_name,
        "--",
        "/bin/bash",
        "-l",
    ]

    with cluster.auth():
        subprocess.check_call(cmd)

consideRatio commented 1 month ago

Though this last is an assumption since the pod is cleaned up too quickly for me to see any logs.

I think sometimes you can see details about this even for a deleted pod, by doing kubectl get event -n <namespace> and looking. k8s Event resources are cleaned up after one hour though, so this only provides some additional time.

After that, its possible to see details about this in cloud logs with a query like below (taken from a sample query in google docs):

resource.type="k8s_cluster" AND
log_id("events")

I'll dig into this now

consideRatio commented 1 month ago

I didn't find any OOMKilled issue, and using the n2-highmem-16 machine you had a lot of memory compared to our docs suggestion of using a VM with 8CPU and 32GB, due to this I figure it may not have been an OOM issue but something else.
Our docs suggest use of 500GB boot disk for the VM to create, but the n2-highmem-16 machine only has 100GB via our terraform defaults. I figure I'll try do to what you did but with a k8s node that I create for this purpose with a 4TB boot disk --- I have no idea how much disk space we need, so I'm opting for a large one now

consideRatio commented 1 month ago

I got "no space left on device" now =/

The current awi-ciroh filestore use has grown again to use 90% of the 9TiB, so me copying to the 3TiB filestore fails with "no space left on device" which makes sense. This is showing a few weeks back:

sgibson91 commented 1 month ago

@jmunroe I think at this point we need a discussion with the community about what they actually want copying over if they want to stick with 3TB.

jmunroe commented 1 month ago

I agree ... I'm writing that email now.

consideRatio commented 1 month ago

Thank you @jmunroe!

consideRatio commented 1 month ago

I saw that we can accomplish something similar to what you did with hardcoded changes @sgibson91 by doing this, where the three last arguments are about an extra nfs server mount.

deployer exec root-homes awi-ciroh prod 10.146.96.42 "/homes/" "/root-homes-2"

That left the following changes in the k8s Pod manifest that I also used:

            "nodeSelector": {
                "cloud.google.com/gke-nodepool": "nb-nfs-copy-node",
            },
            "tolerations": [
                {
                    "key": "hub.jupyter.org_dedicated",
                    "operator": "Equal",
                    "value": "user",
                    "effect": "NoSchedule",
                }
            ],

            # ...
            # inside a container...

                    "resources": {
                        "requests": {"memory": "27Gi", "cpu": 7},
                        "limits": {"memory": "32Gi", "cpu": 8},
                    },

In awi-ciroh's terraform.tfvars, I added this:

notebook_nodes = {
  # FIXME: To be deleted when NFS copy operation is done
  #
  #        Added by Erik handling https://github.com/2i2c-org/infrastructure/issues/4844,
  #        for use with Sarah's adjustment in https://github.com/2i2c-org/infrastructure/issues/4844#issuecomment-2416749438
  #
  "nfs-copy-node" : {
    min : 1,
    max : 1,
    machine_type : "n4-standard-8",
    disk_type : "hyperdisk-balanced",
    disk_size_gb : 4000,
  },
  # ...

sgibson91 commented 1 month ago

There has been an update on the freshdesk ticket and they have cleaned up some data so we can proceed with the 3TB. I will pick this back up again and aim to complete before my AL starts at the end of Wednesday.

sgibson91 commented 1 month ago

Transfer started

sgibson91 commented 1 month ago

Argh, the pod still got deleted before completion!

I created the new nodepool and used the taints/tolerations/resources Erik documented above :/

sgibson91 commented 1 month ago

I tried bumping the nfs-copy-node instance to a n4-standard-16 and the pod still got cancelled with a 4TB boot disk and following resources

"resources": {
                        "requests": {"memory": "54Gi", "cpu": 15},
                        "limits": {"memory": "64Gi", "cpu": 16},
                    },

sgibson91 commented 1 month ago

When we first migrated the AWI-CIROH filestore, @yuvipanda was granted the permission to SSH into VMs he created so the migration could happen the traditional way and he was granted it. I've asked him who we should contact so that Erik, Georgiana and I can be given the same permissions. We can then do this job properly, rather than special-casing. This person's name may be Ben?

EDIT: From the freshdesk ticket, this person may well be Ben Lee.

sgibson91 commented 1 month ago

I have responded on the ticket requesting that Georgiana, Erik and I have our permissions upgraded to match Yuvi's. That way we should all be able to ssh into a VM that 1) hopefully won't be subject to whatever is killing the k8s pod, and 2) the process can be monitored and completed by someone not me when I go on AL.

sgibson91 commented 1 month ago

Permissions have been granted. I shall try with a VM.

sgibson91 commented 1 month ago

While Ben said he had given me (us) permissions, I still can't ssh into a VM. I followed up in the freshdesk ticket last night, but no response as of yet.

I'm going to unassign myself from this one as there's no way I can make progress before my AL.

2) the process can be monitored and completed by someone not me

I actually don't think this would've been true anyway, as the process would've been running under my user on the VM. But at least it wouldn't have been on my local laptop!

yuvipanda commented 1 month ago

I've picked this up today

yuvipanda commented 1 month ago

This migration is now completed (I'll provide more info + next steps on Monday).

I've quickly let the community know. If they don't report any issues by wednesday, we can decomission the old filestore on thursday and call this done.

aprilmj commented 3 weeks ago

This one lingered awhile - should we have a short wash up/retro about where we got stuck, so we can learn from it?

yuvipanda commented 3 weeks ago

we definitely should!

yuvipanda commented 2 weeks ago

https://github.com/2i2c-org/meta/issues/1626 tracks retrospective, and https://github.com/2i2c-org/meta/issues/1627 tracks decomissioning the old filestore. Am going to close this one now.

2i2c-org / infrastructure