2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
105 stars 64 forks source link

Migrate home directories of Pangeo users to the 2i2c Hub #653

Closed choldgraf closed 3 years ago

choldgraf commented 3 years ago

Description

There are many users that are currently on the old Pangeo JupyterHub (at https://us-central1-b.gcp.pangeo.io/). We should migrate their user home directories to the hub that we are deploying.

Value / benefit

This will minimize the disruption that these users feel when they migrate from one hub to the next.

Implementation details

We should understand whether we need to simply point the old hub's user filesystem to our new hub, or if we will have to move those filesystems instead. Update 2021/10/06: We will be copying contents from one filesystem to another.

We'll need to make sure that the new hub is "ready to go" when this happens, because it will force many users to use the new hub since that's where their work will be. Update 2021-10-06: Sarah plans to make posts in Pangeo discourse which will detail when the domain name switch will happen and when the last data migration happened before that date (a gap she will try to minimise to the best of her ability).

Since the new hub (https://pangeo.2i2c.cloud) has current users, at the hub database step of the move we will need to take care to merge the two databases rather than overwrite them, so no one's access is lost.

The old hub is at https://us-central1-b.gcp.pangeo.io/.

Tasks to complete

Updates

rabernat commented 3 years ago

This issue is a MUST HAVE for the migration of the Pangeo GCP production cluster.

sgibson91 commented 3 years ago

Docs we already have on moving user directories: https://pilot-hubs.2i2c.org/en/latest/howto/operate/move-hub.html

Since the new hub is using Google Filestore, I will need to mount that to a VM to be able to carry out the transfer. Instructions here: https://cloud.google.com/filestore/docs/creating-instances

I'm keen to use rsync rather than scp as I think it allows a more elegant handling of when to overwrite files https://linuxize.com/post/how-to-transfer-files-with-rsync-over-ssh/

This StackOverflow answer provides a Python class to merge SQLite db's https://stackoverflow.com/a/61954182

sgibson91 commented 3 years ago

@paigem there is a wrinkle with migrating the COESSING users home directories. Since the COESSING hub uses Google for auth and the Pangeo hub uses GitHub, the paths won't match up even if I do transfer the data because they'll contain their emails, not GitHub handles. Therefore, I strongly recommend they download their work locally to upload to the new hub when they switch is made (both hubs will be available simultaneously for a short while) since there'll be no way for me to map emails to github handles.

sgibson91 commented 3 years ago

@jhamman @scottyhq if either of you have the time to go through the GCP hub's nfs storage with me, I'd really appreciate it! It's not setup how I was expecting.

drwxr-xr-x 3 gke-efc6f9dd0553c8d21056 gke-efc6f9dd0553c8d21056 4096 Aug 16  2020 gke-efc6f9dd0553c8d21056
drwxr-xr-x 3 jhamman                  jhamman                  4096 Jun 26  2020 jhamman
drwx------ 10 ubuntu ubuntu 4096 Jul 26 07:59 <GITHUB_ID>

I could be looking in the wrong place though?

scottyhq commented 3 years ago

@scottyhq if either of you have the time to go through the GCP hub's nfs storage with me, I'd really appreciate it! It's not setup how I was expecting.

Sorry, but i have only worked on the AWS infrastructure, so won't be able to help here

rabernat commented 3 years ago

I am the one who executed the previous migration of home directories from the older cluster (ocean.pangeo.io) to the current one (https://us-central1-b.gcp.pangeo.io/). This was very hard because the old cluster used ORCID (via Globus) for auth, so we had to create a mapping between ORCID and GitHub user name. Then we gzipped each user's home directory from one cluster and extracted it to the new cluster under a new username one at a time.

Some of the scripts I used to do this were archived here: https://gist.github.com/rabernat/c9b352de926756342e86da662a0eadf9

I believe that the script is telling us that the user homedirs should live in /mnt/nfs/uscentral1b/. However, the absolute path I guess depends on how the NFS volume is mounted.

Is this at all helpful?

sgibson91 commented 3 years ago

Thanks Ryan, I'm sure the scripts will come in handy, but I'm still struggling to find anything!

sgibson@homedir-manager-2:~$ sudo ls -al /mnt/nfs
total 8
drwxr-xr-x 2 root root 4096 Jun 26  2020 .
drwxr-xr-x 3 root root 4096 Jun 26  2020 ..
sgibson@homedir-manager-2:~$ find uscentral1b -type d
find: ‘uscentral1b’: No such file or directory
sgibson@homedir-manager-2:~$ sudo find uscentral1b -type d
find: ‘uscentral1b’: No such file or directory
sgibson@homedir-manager-2:~$

Update: A different find command

sgibson@homedir-manager-2:/$ sudo find . -type d -name uscentral1b
sgibson@homedir-manager-2:/$
rabernat commented 3 years ago

Here is how I look at the home directories

I don't know what the VM instance homedir-manager-2 is.

sgibson91 commented 3 years ago

Thanks @rabernat, the above worked. I wasn't aware the old cluster was using a filestore (which is good as that's what the new cluster is also using!). I've got the directories now :)

I don't know what the VM instance homedir-manager-2 is.

I actually think this is the VM for the last migration as it has ocean.pangeo.io/ under rpa 😂

sgibson91 commented 3 years ago

I have successfully mounted each NFS filestore to a VM in each Google Cloud project and found the locations of existing user directories and where they should be copied to. However, I am now stuck establishing an ssh connection between the two VMs.

Definitions:

What I did:

I also have logs from:

but wasn't sure what from those logs would be safe to copy-paste into a public issue

sgibson91 commented 3 years ago

Things changed:

sgibson91 commented 3 years ago

I've narrowed this down to something being up with the ubuntu user that we're trying to ssh as (see step 3). I can ssh into the source VM from my local machine using the keys I generated as sgibson (having listed the public part in /home/sgibson/.ssh/authorized_keys) but not as ubuntu (also having listed the public part of the key under /home/ubuntu/.ssh/authorized_keys.

I had trouble with the su ubuntu command list in our docs since GCP VMs don't come configured with a root password, so I had to do something like sudo -s && su ubuntu and I'm just not sure that has been set up properly.

However, the --archive [-a] option to rsync claims to preserve attributes of files, I'm hoping that means the UID? If so, maybe we can just forget the ubuntu user part?

sgibson91 commented 3 years ago

I also double checked that I could ssh from target VM to source VM with the original ssh key as sgibson AND I CAN. But rsync then doesn't work, I get the same "Permission denied (publickey)" error. However, scp was successful in copying over a single file 🙌🏻 BUT the user and group were root rather than ubuntu which is not what I think we're after.

Basically, I think I botched the whole su ubuntu part of the instructions here

sgibson91 commented 3 years ago

I think I needed to follow something like these instructions: https://www.digitalocean.com/community/tutorials/how-to-create-a-sudo-user-on-ubuntu-quickstart so I'll try that tomorrow

We should definitely update our docs on this!

damianavila commented 3 years ago

@sgibson91, quick question, what are the permissions and ownership in the .ssh directories? In the past, I have experienced "Permission denied" issues when the ownership and the permissions were not the expected ones... For instance, for the files under /home/ubuntu/.ssh, I would expect ownership by the ubuntu:ubuntu user/group, the .ssh directory with chmod 700, public keys with chmod 644, and private ones with chmod 600, IIRC. From your description of the problem, it seems some ownership/permission issue is being the underlying cause, IMHO.

sgibson91 commented 3 years ago

I would expect ownership by the ubuntu:ubuntu user/group

I agree, and my current suspicion is that it's because the ubuntu user/group doesn't exist and I'll need to set that up using the link in my previous comment.

sgibson91 commented 3 years ago

@sgibson91, quick question, what are the permissions and ownership in the .ssh directories? In the past, I have experienced "Permission denied" issues when the ownership and the permissions were not the expected ones... For instance, for the files under /home/ubuntu/.ssh, I would expect ownership by the ubuntu:ubuntu user/group, the .ssh directory with chmod 700, public keys with chmod 644, and private ones with chmod 600, IIRC. From your description of the problem, it seems some ownership/permission issue is being the underlying cause, IMHO.

I tried you suggestion @damianavila still with no luck 😭

ubuntu@pangeo-migration-vm:~$ ls -al
total 28
drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 11 12:30 .
drwxr-xr-x 4 root   root   4096 Oct  6 10:06 ..
-rw------- 1 ubuntu ubuntu 2003 Oct 11 14:44 .bash_history
-rw-r--r-- 1 ubuntu ubuntu  220 Feb 25  2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25  2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu  807 Feb 25  2020 .profile
drwx------ 2 ubuntu ubuntu 4096 Oct 11 14:01 .ssh
-rw-r--r-- 1 ubuntu ubuntu    0 Oct 11 11:47 .sudo_as_admin_successful
ubuntu@pangeo-migration-vm:~$ ls -al .ssh/
total 20
drwx------ 2 ubuntu ubuntu 4096 Oct 11 14:01 .
drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 11 12:30 ..
-rw------- 1 ubuntu ubuntu    0 Oct  6 10:05 authorized_keys
-rw-r--r-- 1 ubuntu ubuntu  222 Oct 11 11:58 known_hosts
-rw------- 1 ubuntu ubuntu 2610 Oct 14 13:47 nfs-transfer-key
-rw-r--r-- 1 ubuntu ubuntu  580 Oct 14 13:47 nfs-transfer-key.pub
ubuntu@pangeo-migration-vm:~$ chmod 700 ~/.ssh/nfs-transfer-key.pub
ubuntu@pangeo-migration-vm:~$ chmod 644 ~/.ssh/nfs-transfer-key
ubuntu@pangeo-migration-vm:~$ scp -r -p -i ~/.ssh/nfs-transfer-key ubuntu@104.154.182.94:/mnt/filestore/uscentral1b/aaronspring/Climpred_demo.ipynb /mnt/filestore/staging/aaronspring/
ubuntu@104.154.182.94: Permission denied (publickey).
ubuntu@pangeo-migration-vm:~$
damianavila commented 3 years ago

Well... chmod 700 should be in the .ssh directory, chmod 644 for public keys, and chmod 600 for private ones. I think you have things different from the output you pasted above (ie. the nfs-transfer-key should be 600 instead of 644).

Btw, maybe we can jump in a video together? I have pinged you in Slack to find some time.

rabernat commented 3 years ago

I think a reasonable course of action would be to just exclude the $HOME/.ssh directory from the migration completely.

Rotating SSH keys periodically would be a wise choice anyway.

Also, please do not clobber my home directory. I have the same username on both systems.

sgibson91 commented 3 years ago

I think a reasonable course of action would be to just exclude the $HOME/.ssh directory from the migration completely.

I am not trying to migrate the .ssh folder, I am trying to give the old VM a public ssh key from the new VM so that I can scp/rsync the home dirs across! At the minute I cannot transfer anything!

Also, please do not clobber my home directory. I have the same username on both systems.

Ideally, I would like to use rsync so the 2 are merged rather than overwritten but I guess the only way for me to guarantee that your home directory is not clobbered would be for me to exclude it and for you to migrate it yourself?

rabernat commented 3 years ago

Oops, sorry for parachuting in with an irrelevant suggestion. I clearly misinterpreted the context.

I would like to use rsync so the 2 are merged rather than overwritten

This sounds perfect. So no special treatment needed. 👍

sgibson91 commented 3 years ago

Ok, I'm now at the stage where I've managed to migrate 1 user home dir, but it has not migrated with the correct ownership. It has migrated with ownership ubuntu:root rather than ubuntu:ubuntu. I'm not sure if this is because I had to use sudo so rsync had permissions to create directories. I guess the worst case scenario here is that we run a recursive chown command over the filestore.

sgibson91 commented 3 years ago

I opened https://github.com/2i2c-org/pilot-hubs/pull/753 to better document this process

sgibson91 commented 3 years ago

Update

So after a chat with @yuvipanda today, there are a couple of things I realised won't be strictly necessary for the migration, particularly for staging.

User home dirs on staging

There is a difference between how Pangeo currently uses NFS directories and how we at 2i2c have set them up. Currently, Pangeo has one single folder called uscentral1b containing all directories regardless of a user logging in via the staging or prod hub. At 2i2c, we have configured separate subdirs for staging and prod and these do not mirror each other. This is so that if staging is breached, users' files are not accessible. It also gives the engineers the freedom to break staging without risking users' files.

Given that the Pangeo home dirs is ~1TB of data, by not copying them into the staging subfolder we save ourselves from needlessly doubling the NFS size or making users' files vulnerable to breaches on staging. This means that user's home dirs will not be available on staging after migration but I think that's a fair expectation.

Hence the remaining to-do items for staging migration:

JupyterHub databases

I had wondered if I'd need to merge the two databases from the old and new JupyterHubs, but this is only critical if users have been added manually. Once https://github.com/2i2c-org/infrastructure/issues/733 has been processed, that won't be the case as auth is being handled "outside" the hub (i.e. by GitHub). The JupyterHub db has been designed to be transient and able to reconstruct itself from previous states, hence I don't think we need to do anything with the hub db.

Remaining to-do items for prod migration:

sgibson91 commented 3 years ago

I am scheduling the next data migration for 5th November 2021 ready for the prod hub to go live on 8th November 2021

sgibson91 commented 3 years ago

Kicked off the migration process into the prod folder on the filestore

sgibson91 commented 3 years ago

Migration completed!