det-lab / jupyterhub-deploy-kubernetes-jetstream

CDMS JupyterHub deployment on XSEDE Jetstream
0 stars 1 forks source link

Offsite backup of user data #44

Closed zonca closed 3 years ago

zonca commented 3 years ago

from @pibion:

I'm wondering if there's a way to access the existing storage volumes, and I'm also wondering if there's a way to set a permanent backup with e.g. the Open Storage Network (CDMS has an allocation there now). Adding @glass-ships and @thathayhaykid as they might be interested in thinking about this.

zonca commented 3 years ago

@pibion we are thinking about just the user data, right? now the large 500GB volume, right?

pibion commented 3 years ago

@zonca correct, just the user data. We could definitely put strong caps on that. It would be nice to give people 5 GB home spaces, but we could likely go lower if we think that might be a problem.

zonca commented 3 years ago

ok, I tested the Openstack volume backup service https://docs.openstack.org/cinder/train/admin/blockstorage-volume-backups.html, but is not available on Jetstream:

> openstack volume backup create 274a6044-72bd-4538-b01c-2ffe10966f43
Service cinder-backup could not be found.

Let me ask first the Jetstream team if they have recommendations

zonca commented 3 years ago

@pibion do you have some details on what is the Open Storage Network offering you? Object store / NFS server / something else?

pibion commented 3 years ago

They offer object store.

zonca commented 3 years ago

@pibion S3 or SWIFT API for the object store?

zonca commented 3 years ago

ok, it seems like doing at Openstack level is not possible if we don't have cinder-backup, I'll investigate Kubernetes solutions, starting from this maybe, https://www.fairwinds.com/blog/gemini-automate-backups-of-persistentvolumes-in-kubernetes, I'll do some research and report back. In the meantime it would be useful to have more information about Open Storage Network.

pibion commented 3 years ago

@zonca the Open Storage Network uses s3 API.

zonca commented 3 years ago

Gemini requires VolumeSnapshot functionality we don't have currently installed in the cluster, stash looks interesting https://github.com/stashed/stash, it can save into a S3 endpoint. I think I could do a test in a development deployment, @pibion can you please add the credentials for OSN to the https://github.com/pibion/jupyterhub-deploy-kubernetes-jetstream-secrets repo?

pibion commented 3 years ago

@zonca credentials are added!

zonca commented 3 years ago

ok, I tested the aws command line client with OSN and works fine. I added a preconfigured config file and updated instructions.

zonca commented 3 years ago

I requested a free license for Stash so I can try it out

zonca commented 3 years ago

still working on Stash, cannot get it to work. It seems those systems are very complicated and targeted at larger deployments. I think I'll try another search looking specifically for something simpler.

zonca commented 3 years ago

ok, at the end I got Stash to work. It is quite nice, it only backs up the content of the user volumes encrypted to object store. It can be restored back to Kubernetes using the same service.

I wrote a tutorial about it:

https://zonca.dev/2021/04/jetstream-backup-kubernetes-volumes-object-store.html

Now I will test just leaving the daily backup service running just for my volume for a few days and see how it behaves.

Next we can setup other users and scale this up gradually.

pibion commented 3 years ago

@zonca fantastic! If we want to practice wiping a user volume and restoring, I can volunteer.

zonca commented 3 years ago

I tested already on mine and worked fine. Next week if the daily backups on my volume I'd like to test on 4/5 users. Do you have some preference or should I just you and other 4 randomly just to prove it is working?

pibion commented 3 years ago

Right now we only have four regular users, and none of these have work that isn't version controlled. So go ahead and use all of us.

We'll be running some tutorials within our collaboration, so I'm expecting that a few more folks will be signing on. There are a couple who stopped using the platform because they were afraid of data loss, so once this test is complete I can reach out to them and see if the platform has what they need to return.

zonca commented 3 years ago

I have pibion, ramirece and zkromer

zonca commented 3 years ago

ok, nightly backup of my volume worked fine. Now I have 4 nightly backups, spaced 10 minutes:

kj get backupconfiguration
NAME              TASK         SCHEDULE     PAUSED   AGE
backup-pibion     pvc-backup   10 8 * * *   false    8m38s
backup-ramirece   pvc-backup   20 8 * * *   false    74s
backup-zkromer    pvc-backup   30 8 * * *   false    21s
test-backup       pvc-backup   0 8 * * *    false    3d17h

Let's leave them running for a week, then do a restore test on the volume of @pibion

zonca commented 3 years ago

Most backups worked:

NAME                         INVOKER-TYPE          INVOKER-NAME      PHASE       AGE
backup-pibion-1619251806     BackupConfiguration   backup-pibion     Succeeded   2d14h
backup-pibion-1619338211     BackupConfiguration   backup-pibion     Succeeded   38h
backup-pibion-1619424604     BackupConfiguration   backup-pibion     Succeeded   14h
backup-ramirece-1619252409   BackupConfiguration   backup-ramirece   Succeeded   2d14h
backup-ramirece-1619338803   BackupConfiguration   backup-ramirece   Succeeded   38h
backup-ramirece-1619425208   BackupConfiguration   backup-ramirece   Succeeded   14h
backup-zkromer-1619253002    BackupConfiguration   backup-zkromer    Succeeded   2d14h
backup-zkromer-1619339407    BackupConfiguration   backup-zkromer    Succeeded   38h
backup-zkromer-1619425804    BackupConfiguration   backup-zkromer    Skipped     14h
test-backup-1619251205       BackupConfiguration   test-backup       Succeeded   2d15h
test-backup-1619337607       BackupConfiguration   test-backup       Succeeded   39h
test-backup-1619424011       BackupConfiguration   test-backup       Succeeded   15h

@zkromerUCD 's volume has a problem, probably due to an issue on Jetstream we haven't figured out yet: https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/issues/40#issuecomment-827202593

@zkromerUCD please do not use JupyterHub for a couple of days as we are debugging, then I'll setup a workaround script.

zonca commented 3 years ago

ok @zkromerUCD, I setup the workaround script so your volume should be fine, please let me know if you have issues logging in.

zonca commented 3 years ago

ok, backups are still running fine:

NAME                         INVOKER-TYPE          INVOKER-NAME      PHASE       AGE
backup-pibion-1620461406     BackupConfiguration   backup-pibion     Succeeded   2d10h
backup-pibion-1620547803     BackupConfiguration   backup-pibion     Succeeded   34h
backup-pibion-1620634204     BackupConfiguration   backup-pibion     Succeeded   10h
backup-ramirece-1620462009   BackupConfiguration   backup-ramirece   Succeeded   2d10h
backup-ramirece-1620548407   BackupConfiguration   backup-ramirece   Succeeded   34h
backup-ramirece-1620634807   BackupConfiguration   backup-ramirece   Succeeded   10h
backup-zkromer-1620462603    BackupConfiguration   backup-zkromer    Succeeded   2d10h
backup-zkromer-1620549010    BackupConfiguration   backup-zkromer    Succeeded   34h
backup-zkromer-1620635403    BackupConfiguration   backup-zkromer    Succeeded   10h
test-backup-1620460802       BackupConfiguration   test-backup       Succeeded   2d11h
test-backup-1620547211       BackupConfiguration   test-backup       Succeeded   35h
test-backup-1620633610       BackupConfiguration   test-backup       Succeeded   11h

I think we can do a restore test on another volume instead of mine.

@pibion is it ok if I drop your volume then try to restore from backup?

zonca commented 3 years ago

for long-term retention policy, I would do:

keepLast: 6                                                                                                                                        
keepWeekly: 7                                                                                                                                      
keepMonthly: 12 

so last 6 days, last 7 weeks and last 12 months. The backup automatically stores just 1 copy of a file, so multiple backups of the same files are just counted once (pretty nice!)

I also added 2 more section to the tutorial, one about using the restic command to access the backups, one about automating the configuration for multiple users (blog is still building, will be ready in a bit):

https://zonca.dev/2021/04/jetstream-backup-kubernetes-volumes-object-store.html

zonca commented 3 years ago

still working on this:

zonca commented 3 years ago

ok, I implemented a workaround, tested it and updated tutorial: https://github.com/zonca/zonca-blog/commit/0a7fe8100bca850e943124d81fb02d1cd8d9f24d

now let's have the machinery run for a few days and check again.

I also saw other users are logging in, so I am backing them up:

pibion
ramirece
zkromer
zonca
gardn701
humerbenjamin
mbaiocchi
zonca commented 3 years ago

ok, the system itself worked fine, but I have 2 volumes with issues in attaching/reattaching, pibion and ramirece, asked Jetstream for support.

zonca commented 3 years ago

ok, it has been working reliably for a week, it seems ok.

Next I'll test again dropping a volume and restoring from backup.

zonca commented 3 years ago

ok, testing the restore functionality

I have some files in my volume:

image

I added a new file not backed up to make sure I am really recovering from a backup.

I delete the whole kubernetes volume:

> kj delete pvc claim-zonca
persistentvolumeclaim "claim-zonca" deleted

I log out and back in, kubernetes creates for me another empty volume:

image

to be continued...

zonca commented 3 years ago

Restore (by the kubernetes admin)

backups are all stored together, so first need to tag the users (see the blog post above for more details):

. restic_tag_usernames.sh

so we now have backups tagged by username:

210dc6bb  2021-07-06 01:01:17  host-0      pibion                                                            /stash-data
ffdacc6c  2021-07-06 01:10:22  host-0      ramirece                                                          /stash-data
04878c6c  2021-07-06 01:20:25  host-0      zkromer                                                           /stash-data
91277a2b  2021-07-06 01:30:19  host-0      zonca                                                             /stash-data
bd1f369f  2021-07-06 01:40:23  host-0      gardn701                                                          /stash-data
918dd498  2021-07-06 01:50:17  host-0      humerbenjamin                                                     /stash-data
2b5c8ce5  2021-07-07 01:00:21  host-0      pibion                                                            /stash-data
1a6d40ca  2021-07-07 01:10:19  host-0      ramirece                                                          /stash-data
727f8586  2021-07-07 01:20:24  host-0      zkromer                                                           /stash-data
240b3319  2021-07-07 01:30:18  host-0      zonca                                                             /stash-data
141dd130  2021-07-07 01:40:22  host-0      gardn701                                                          /stash-data
54c2b6a2  2021-07-07 01:50:17  host-0      humerbenjamin                                                     /stash-data

We can identify the ID of the backup we want to restore from, in our case 91277a2b, write that into stash_restore.yaml, and give it a unique name.

> kubectl -n jhub create -f stash_restore.yaml
restoresession.stash.appscode.com/restorezonca created

Here we go, the old files are copied into the volume, so we are not overwriting new files, I have configured restic so it backs up just the content, it doesn't do a snapshot of the whole volume:

image

ok, I consider this completed, it took some time!!

@pibion

zonca commented 3 years ago

also updated the tutorial with the latest findings: https://zonca.dev/2021/04/jetstream-backup-kubernetes-volumes-object-store.html