Closed choldgraf closed 3 years ago
This issue is a MUST HAVE for the migration of the Pangeo GCP production cluster.
Docs we already have on moving user directories: https://pilot-hubs.2i2c.org/en/latest/howto/operate/move-hub.html
Since the new hub is using Google Filestore, I will need to mount that to a VM to be able to carry out the transfer. Instructions here: https://cloud.google.com/filestore/docs/creating-instances
I'm keen to use rsync
rather than scp
as I think it allows a more elegant handling of when to overwrite files https://linuxize.com/post/how-to-transfer-files-with-rsync-over-ssh/
This StackOverflow answer provides a Python class to merge SQLite db's https://stackoverflow.com/a/61954182
@paigem there is a wrinkle with migrating the COESSING users home directories. Since the COESSING hub uses Google for auth and the Pangeo hub uses GitHub, the paths won't match up even if I do transfer the data because they'll contain their emails, not GitHub handles. Therefore, I strongly recommend they download their work locally to upload to the new hub when they switch is made (both hubs will be available simultaneously for a short while) since there'll be no way for me to map emails to github handles.
@jhamman @scottyhq if either of you have the time to go through the GCP hub's nfs storage with me, I'd really appreciate it! It's not setup how I was expecting.
homedir-manager-2
VM/home
there are some user dirs but mostly gke-<SOME_HASH>
dirs, e.g.:drwxr-xr-x 3 gke-efc6f9dd0553c8d21056 gke-efc6f9dd0553c8d21056 4096 Aug 16 2020 gke-efc6f9dd0553c8d21056
drwxr-xr-x 3 jhamman jhamman 4096 Jun 26 2020 jhamman
drwx------ 10 ubuntu ubuntu 4096 Jul 26 07:59 <GITHUB_ID>
I could be looking in the wrong place though?
@scottyhq if either of you have the time to go through the GCP hub's nfs storage with me, I'd really appreciate it! It's not setup how I was expecting.
Sorry, but i have only worked on the AWS infrastructure, so won't be able to help here
I am the one who executed the previous migration of home directories from the older cluster (ocean.pangeo.io) to the current one (https://us-central1-b.gcp.pangeo.io/). This was very hard because the old cluster used ORCID (via Globus) for auth, so we had to create a mapping between ORCID and GitHub user name. Then we gzipped each user's home directory from one cluster and extracted it to the new cluster under a new username one at a time.
Some of the scripts I used to do this were archived here: https://gist.github.com/rabernat/c9b352de926756342e86da662a0eadf9
I believe that the script is telling us that the user homedirs should live in /mnt/nfs/uscentral1b/
. However, the absolute path I guess depends on how the NFS volume is mounted.
Is this at all helpful?
Thanks Ryan, I'm sure the scripts will come in handy, but I'm still struggling to find anything!
sgibson@homedir-manager-2:~$ sudo ls -al /mnt/nfs
total 8
drwxr-xr-x 2 root root 4096 Jun 26 2020 .
drwxr-xr-x 3 root root 4096 Jun 26 2020 ..
sgibson@homedir-manager-2:~$ find uscentral1b -type d
find: ‘uscentral1b’: No such file or directory
sgibson@homedir-manager-2:~$ sudo find uscentral1b -type d
find: ‘uscentral1b’: No such file or directory
sgibson@homedir-manager-2:~$
Update: A different find
command
sgibson@homedir-manager-2:/$ sudo find . -type d -name uscentral1b
sgibson@homedir-manager-2:/$
Here is how I look at the home directories
10.126.142.50:/home
sudo apt-get -y update &&
sudo apt-get install nfs-common
sudo mkdir -p /mnt/filestore
sudo mount 10.126.142.50:/home /mnt/filestore
cd /mnt/filestore/uscentral1b
ls # -> all the directories are there
I don't know what the VM instance homedir-manager-2
is.
Thanks @rabernat, the above worked. I wasn't aware the old cluster was using a filestore (which is good as that's what the new cluster is also using!). I've got the directories now :)
I don't know what the VM instance homedir-manager-2 is.
I actually think this is the VM for the last migration as it has ocean.pangeo.io/
under rpa
😂
I have successfully mounted each NFS filestore to a VM in each Google Cloud project and found the locations of existing user directories and where they should be copied to. However, I am now stuck establishing an ssh connection between the two VMs.
Definitions:
pangeo-integration-te
GCP projectpangeo-181919
GCP projectWhat I did:
sudo -s
and set a root password. This changed my prompt from sgibson@target-vm
to ubuntu@target-vm
ssh-keygen -f nfs-transfer-key
. I made sure the output files were in ~/.ssh/
nfs-transfer-key.pub
from the target VM to the source VMrsync -azvhP ubuntu@SOURCE-VM-IP:/mnt/filestore/uscentral1b/ /mnt/filestore/staging/
I also tried this command too:
scp -p -r -i ~/.ssh/nfs-transfer-key ubuntu@SOURCE-VM-IP:/mnt/filestore/uscentral1b/ /mnt/filestore/staging/
I tried both with and without sudo
Permission denied (publickey).
I also have logs from:
ssh -vvv ...
from the target VM to the source VM; andtail -f /var/log/auth.log
but wasn't sure what from those logs would be safe to copy-paste into a public issue
/home/ubuntu/.ssh/authorized_keys
on the source VM should be a file, not a directory (however, this did not resolve the above issue)I've narrowed this down to something being up with the ubuntu
user that we're trying to ssh as (see step 3). I can ssh into the source VM from my local machine using the keys I generated as sgibson
(having listed the public part in /home/sgibson/.ssh/authorized_keys
) but not as ubuntu
(also having listed the public part of the key under /home/ubuntu/.ssh/authorized_keys
.
I had trouble with the su ubuntu
command list in our docs since GCP VMs don't come configured with a root password, so I had to do something like sudo -s && su ubuntu
and I'm just not sure that has been set up properly.
However, the --archive [-a]
option to rsync
claims to preserve attributes of files, I'm hoping that means the UID? If so, maybe we can just forget the ubuntu
user part?
I also double checked that I could ssh from target VM to source VM with the original ssh key as sgibson
AND I CAN. But rsync
then doesn't work, I get the same "Permission denied (publickey)" error. However, scp
was successful in copying over a single file 🙌🏻 BUT the user and group were root
rather than ubuntu
which is not what I think we're after.
Basically, I think I botched the whole su ubuntu
part of the instructions here
I think I needed to follow something like these instructions: https://www.digitalocean.com/community/tutorials/how-to-create-a-sudo-user-on-ubuntu-quickstart so I'll try that tomorrow
We should definitely update our docs on this!
@sgibson91, quick question, what are the permissions and ownership in the .ssh
directories?
In the past, I have experienced "Permission denied" issues when the ownership and the permissions were not the expected ones... For instance, for the files under /home/ubuntu/.ssh
, I would expect ownership by the ubuntu:ubuntu
user/group, the .ssh
directory with chmod 700
, public keys with chmod 644
, and private ones with chmod 600
, IIRC.
From your description of the problem, it seems some ownership/permission issue is being the underlying cause, IMHO.
I would expect ownership by the
ubuntu:ubuntu
user/group
I agree, and my current suspicion is that it's because the ubuntu
user/group doesn't exist and I'll need to set that up using the link in my previous comment.
@sgibson91, quick question, what are the permissions and ownership in the
.ssh
directories? In the past, I have experienced "Permission denied" issues when the ownership and the permissions were not the expected ones... For instance, for the files under/home/ubuntu/.ssh
, I would expect ownership by theubuntu:ubuntu
user/group, the.ssh
directory withchmod 700
, public keys withchmod 644
, and private ones withchmod 600
, IIRC. From your description of the problem, it seems some ownership/permission issue is being the underlying cause, IMHO.
I tried you suggestion @damianavila still with no luck 😭
ubuntu@pangeo-migration-vm:~$ ls -al
total 28
drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 11 12:30 .
drwxr-xr-x 4 root root 4096 Oct 6 10:06 ..
-rw------- 1 ubuntu ubuntu 2003 Oct 11 14:44 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
drwx------ 2 ubuntu ubuntu 4096 Oct 11 14:01 .ssh
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 11 11:47 .sudo_as_admin_successful
ubuntu@pangeo-migration-vm:~$ ls -al .ssh/
total 20
drwx------ 2 ubuntu ubuntu 4096 Oct 11 14:01 .
drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 11 12:30 ..
-rw------- 1 ubuntu ubuntu 0 Oct 6 10:05 authorized_keys
-rw-r--r-- 1 ubuntu ubuntu 222 Oct 11 11:58 known_hosts
-rw------- 1 ubuntu ubuntu 2610 Oct 14 13:47 nfs-transfer-key
-rw-r--r-- 1 ubuntu ubuntu 580 Oct 14 13:47 nfs-transfer-key.pub
ubuntu@pangeo-migration-vm:~$ chmod 700 ~/.ssh/nfs-transfer-key.pub
ubuntu@pangeo-migration-vm:~$ chmod 644 ~/.ssh/nfs-transfer-key
ubuntu@pangeo-migration-vm:~$ scp -r -p -i ~/.ssh/nfs-transfer-key ubuntu@104.154.182.94:/mnt/filestore/uscentral1b/aaronspring/Climpred_demo.ipynb /mnt/filestore/staging/aaronspring/
ubuntu@104.154.182.94: Permission denied (publickey).
ubuntu@pangeo-migration-vm:~$
Well... chmod 700 should be in the .ssh directory, chmod 644 for public keys, and chmod 600 for private ones. I think you have things different from the output you pasted above (ie. the nfs-transfer-key should be 600 instead of 644).
Btw, maybe we can jump in a video together? I have pinged you in Slack to find some time.
I think a reasonable course of action would be to just exclude the $HOME/.ssh
directory from the migration completely.
Rotating SSH keys periodically would be a wise choice anyway.
Also, please do not clobber my home directory. I have the same username on both systems.
I think a reasonable course of action would be to just exclude the
$HOME/.ssh
directory from the migration completely.
I am not trying to migrate the .ssh folder, I am trying to give the old VM a public ssh key from the new VM so that I can scp/rsync the home dirs across! At the minute I cannot transfer anything!
Also, please do not clobber my home directory. I have the same username on both systems.
Ideally, I would like to use rsync so the 2 are merged rather than overwritten but I guess the only way for me to guarantee that your home directory is not clobbered would be for me to exclude it and for you to migrate it yourself?
Oops, sorry for parachuting in with an irrelevant suggestion. I clearly misinterpreted the context.
I would like to use rsync so the 2 are merged rather than overwritten
This sounds perfect. So no special treatment needed. 👍
Ok, I'm now at the stage where I've managed to migrate 1 user home dir, but it has not migrated with the correct ownership. It has migrated with ownership ubuntu:root
rather than ubuntu:ubuntu
. I'm not sure if this is because I had to use sudo
so rsync
had permissions to create directories. I guess the worst case scenario here is that we run a recursive chown
command over the filestore.
I opened https://github.com/2i2c-org/pilot-hubs/pull/753 to better document this process
So after a chat with @yuvipanda today, there are a couple of things I realised won't be strictly necessary for the migration, particularly for staging.
There is a difference between how Pangeo currently uses NFS directories and how we at 2i2c have set them up. Currently, Pangeo has one single folder called uscentral1b
containing all directories regardless of a user logging in via the staging
or prod
hub. At 2i2c, we have configured separate subdirs for staging
and prod
and these do not mirror each other. This is so that if staging is breached, users' files are not accessible. It also gives the engineers the freedom to break staging without risking users' files.
Given that the Pangeo home dirs is ~1TB of data, by not copying them into the staging
subfolder we save ourselves from needlessly doubling the NFS size or making users' files vulnerable to breaches on staging. This means that user's home dirs will not be available on staging after migration but I think that's a fair expectation.
Hence the remaining to-do items for staging migration:
staging.pangeo.2i2c.cloud
at staging...pangeo.io
I had wondered if I'd need to merge the two databases from the old and new JupyterHubs, but this is only critical if users have been added manually. Once https://github.com/2i2c-org/infrastructure/issues/733 has been processed, that won't be the case as auth is being handled "outside" the hub (i.e. by GitHub). The JupyterHub db has been designed to be transient and able to reconstruct itself from previous states, hence I don't think we need to do anything with the hub db.
Remaining to-do items for prod migration:
prod
subdir of NFS filestorepangeo.2i2c.cloud
at prod...pangeo.io
I am scheduling the next data migration for 5th November 2021 ready for the prod hub to go live on 8th November 2021
Kicked off the migration process into the prod
folder on the filestore
Migration completed!
Description
There are many users that are currently on the old Pangeo JupyterHub (at https://us-central1-b.gcp.pangeo.io/). We should migrate their user home directories to the hub that we are deploying.
Value / benefit
This will minimize the disruption that these users feel when they migrate from one hub to the next.
Implementation details
We should understand whether we need to simply point the old hub's user filesystem to our new hub, or if we will have to move those filesystems instead.Update 2021/10/06: We will be copying contents from one filesystem to another.We'll need to make sure that the new hub is "ready to go" when this happens, because it will force many users to use the new hub since that's where their work will be. Update 2021-10-06: Sarah plans to make posts in Pangeo discourse which will detail when the domain name switch will happen and when the last data migration happened before that date (a gap she will try to minimise to the best of her ability).
Since the new hub (https://pangeo.2i2c.cloud) has current users, at the hub database step of the move we will need to take care to merge the two databases rather than overwrite them, so no one's access is lost.
The old hub is at https://us-central1-b.gcp.pangeo.io/.
Tasks to complete
Investigate how to merge SQLite databases so we don't overwrite any of the new hub's user dataNot required. See https://github.com/2i2c-org/infrastructure/issues/653#issuecomment-946497123Updates