Closed zonca closed 3 years ago
@pibion we are thinking about just the user data, right? now the large 500GB volume, right?
@zonca correct, just the user data. We could definitely put strong caps on that. It would be nice to give people 5 GB home spaces, but we could likely go lower if we think that might be a problem.
ok, I tested the Openstack volume backup service https://docs.openstack.org/cinder/train/admin/blockstorage-volume-backups.html, but is not available on Jetstream:
> openstack volume backup create 274a6044-72bd-4538-b01c-2ffe10966f43
Service cinder-backup could not be found.
Let me ask first the Jetstream team if they have recommendations
@pibion do you have some details on what is the Open Storage Network offering you? Object store / NFS server / something else?
They offer object store.
@pibion S3 or SWIFT API for the object store?
ok, it seems like doing at Openstack level is not possible if we don't have cinder-backup
, I'll investigate Kubernetes solutions, starting from this maybe, https://www.fairwinds.com/blog/gemini-automate-backups-of-persistentvolumes-in-kubernetes, I'll do some research and report back.
In the meantime it would be useful to have more information about Open Storage Network.
@zonca the Open Storage Network uses s3 API.
Gemini requires VolumeSnapshot functionality we don't have currently installed in the cluster, stash
looks interesting https://github.com/stashed/stash, it can save into a S3 endpoint.
I think I could do a test in a development deployment, @pibion can you please add the credentials for OSN to the https://github.com/pibion/jupyterhub-deploy-kubernetes-jetstream-secrets repo?
@zonca credentials are added!
ok, I tested the aws
command line client with OSN and works fine.
I added a preconfigured config
file and updated instructions.
I requested a free license for Stash so I can try it out
still working on Stash, cannot get it to work. It seems those systems are very complicated and targeted at larger deployments. I think I'll try another search looking specifically for something simpler.
ok, at the end I got Stash to work. It is quite nice, it only backs up the content of the user volumes encrypted to object store. It can be restored back to Kubernetes using the same service.
I wrote a tutorial about it:
https://zonca.dev/2021/04/jetstream-backup-kubernetes-volumes-object-store.html
Now I will test just leaving the daily backup service running just for my volume for a few days and see how it behaves.
Next we can setup other users and scale this up gradually.
@zonca fantastic! If we want to practice wiping a user volume and restoring, I can volunteer.
I tested already on mine and worked fine. Next week if the daily backups on my volume I'd like to test on 4/5 users. Do you have some preference or should I just you and other 4 randomly just to prove it is working?
Right now we only have four regular users, and none of these have work that isn't version controlled. So go ahead and use all of us.
We'll be running some tutorials within our collaboration, so I'm expecting that a few more folks will be signing on. There are a couple who stopped using the platform because they were afraid of data loss, so once this test is complete I can reach out to them and see if the platform has what they need to return.
I have pibion
, ramirece
and zkromer
ok, nightly backup of my volume worked fine. Now I have 4 nightly backups, spaced 10 minutes:
kj get backupconfiguration
NAME TASK SCHEDULE PAUSED AGE
backup-pibion pvc-backup 10 8 * * * false 8m38s
backup-ramirece pvc-backup 20 8 * * * false 74s
backup-zkromer pvc-backup 30 8 * * * false 21s
test-backup pvc-backup 0 8 * * * false 3d17h
Let's leave them running for a week, then do a restore test on the volume of @pibion
Most backups worked:
NAME INVOKER-TYPE INVOKER-NAME PHASE AGE
backup-pibion-1619251806 BackupConfiguration backup-pibion Succeeded 2d14h
backup-pibion-1619338211 BackupConfiguration backup-pibion Succeeded 38h
backup-pibion-1619424604 BackupConfiguration backup-pibion Succeeded 14h
backup-ramirece-1619252409 BackupConfiguration backup-ramirece Succeeded 2d14h
backup-ramirece-1619338803 BackupConfiguration backup-ramirece Succeeded 38h
backup-ramirece-1619425208 BackupConfiguration backup-ramirece Succeeded 14h
backup-zkromer-1619253002 BackupConfiguration backup-zkromer Succeeded 2d14h
backup-zkromer-1619339407 BackupConfiguration backup-zkromer Succeeded 38h
backup-zkromer-1619425804 BackupConfiguration backup-zkromer Skipped 14h
test-backup-1619251205 BackupConfiguration test-backup Succeeded 2d15h
test-backup-1619337607 BackupConfiguration test-backup Succeeded 39h
test-backup-1619424011 BackupConfiguration test-backup Succeeded 15h
@zkromerUCD 's volume has a problem, probably due to an issue on Jetstream we haven't figured out yet: https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream/issues/40#issuecomment-827202593
@zkromerUCD please do not use JupyterHub for a couple of days as we are debugging, then I'll setup a workaround script.
ok @zkromerUCD, I setup the workaround script so your volume should be fine, please let me know if you have issues logging in.
ok, backups are still running fine:
NAME INVOKER-TYPE INVOKER-NAME PHASE AGE
backup-pibion-1620461406 BackupConfiguration backup-pibion Succeeded 2d10h
backup-pibion-1620547803 BackupConfiguration backup-pibion Succeeded 34h
backup-pibion-1620634204 BackupConfiguration backup-pibion Succeeded 10h
backup-ramirece-1620462009 BackupConfiguration backup-ramirece Succeeded 2d10h
backup-ramirece-1620548407 BackupConfiguration backup-ramirece Succeeded 34h
backup-ramirece-1620634807 BackupConfiguration backup-ramirece Succeeded 10h
backup-zkromer-1620462603 BackupConfiguration backup-zkromer Succeeded 2d10h
backup-zkromer-1620549010 BackupConfiguration backup-zkromer Succeeded 34h
backup-zkromer-1620635403 BackupConfiguration backup-zkromer Succeeded 10h
test-backup-1620460802 BackupConfiguration test-backup Succeeded 2d11h
test-backup-1620547211 BackupConfiguration test-backup Succeeded 35h
test-backup-1620633610 BackupConfiguration test-backup Succeeded 11h
I think we can do a restore test on another volume instead of mine.
@pibion is it ok if I drop your volume then try to restore from backup?
for long-term retention policy, I would do:
keepLast: 6
keepWeekly: 7
keepMonthly: 12
so last 6 days, last 7 weeks and last 12 months. The backup automatically stores just 1 copy of a file, so multiple backups of the same files are just counted once (pretty nice!)
I also added 2 more section to the tutorial, one about using the restic
command to access the backups, one about automating the configuration for multiple users (blog is still building, will be ready in a bit):
https://zonca.dev/2021/04/jetstream-backup-kubernetes-volumes-object-store.html
still working on this:
ok, I implemented a workaround, tested it and updated tutorial: https://github.com/zonca/zonca-blog/commit/0a7fe8100bca850e943124d81fb02d1cd8d9f24d
now let's have the machinery run for a few days and check again.
I also saw other users are logging in, so I am backing them up:
pibion
ramirece
zkromer
zonca
gardn701
humerbenjamin
mbaiocchi
ok, the system itself worked fine, but I have 2 volumes with issues in attaching/reattaching, pibion
and ramirece
, asked Jetstream for support.
ok, it has been working reliably for a week, it seems ok.
Next I'll test again dropping a volume and restoring from backup.
ok, testing the restore functionality
I have some files in my volume:
I added a new file not backed up to make sure I am really recovering from a backup.
I delete the whole kubernetes volume:
> kj delete pvc claim-zonca
persistentvolumeclaim "claim-zonca" deleted
I log out and back in, kubernetes creates for me another empty volume:
to be continued...
backups are all stored together, so first need to tag the users (see the blog post above for more details):
. restic_tag_usernames.sh
so we now have backups tagged by username:
210dc6bb 2021-07-06 01:01:17 host-0 pibion /stash-data
ffdacc6c 2021-07-06 01:10:22 host-0 ramirece /stash-data
04878c6c 2021-07-06 01:20:25 host-0 zkromer /stash-data
91277a2b 2021-07-06 01:30:19 host-0 zonca /stash-data
bd1f369f 2021-07-06 01:40:23 host-0 gardn701 /stash-data
918dd498 2021-07-06 01:50:17 host-0 humerbenjamin /stash-data
2b5c8ce5 2021-07-07 01:00:21 host-0 pibion /stash-data
1a6d40ca 2021-07-07 01:10:19 host-0 ramirece /stash-data
727f8586 2021-07-07 01:20:24 host-0 zkromer /stash-data
240b3319 2021-07-07 01:30:18 host-0 zonca /stash-data
141dd130 2021-07-07 01:40:22 host-0 gardn701 /stash-data
54c2b6a2 2021-07-07 01:50:17 host-0 humerbenjamin /stash-data
We can identify the ID of the backup we want to restore from, in our case 91277a2b
,
write that into stash_restore.yaml
, and give it a unique name.
> kubectl -n jhub create -f stash_restore.yaml
restoresession.stash.appscode.com/restorezonca created
Here we go, the old files are copied into the volume, so we are not overwriting new files, I have configured restic
so it backs up just the content, it doesn't do a snapshot of the whole volume:
ok, I consider this completed, it took some time!!
@pibion
also updated the tutorial with the latest findings: https://zonca.dev/2021/04/jetstream-backup-kubernetes-volumes-object-store.html
from @pibion: