berkeley-dsep-infra / jupyterhub-k8s

[Deprecated] Data 8's deployment of JupyterHub on Kubernetes
Apache License 2.0
34 stars 17 forks source link

Automatically backup student disks #142

Closed jiefugong closed 7 years ago

jiefugong commented 7 years ago

Automatically creates snapshots of GCE disks and clears old snapshots

jiefugong commented 7 years ago

@yuvipanda would you mind taking a look at this PR? https://github.com/data-8/infrastructure/issues/13#issuecomment-286578194

to-do:

SaladRaider commented 7 years ago

@yuvipanda what do you think of the backup script so far?

yuvipanda commented 7 years ago

Great and quick work! I haven't had time to look at it yet, but one question is - how will this be invoked? Each node also has a 100G base disk attached to it that we do not want to backup - only the disks attached to the persistent volume claims in the namespaces we care about. So I was thinking we'd look at the PVC objects using the kubernetes API client and then derive the google cloud disk names from there before snapshotting. How does filtering by name work?

Also, have you tested this in -dev?

Thank you for the quick work! I can look at the code for style and design aftterwards, but since snapshotting doesn't have many downsides (unlike autoscaling!), we can also just deploy this and then do CR afterwards.

On Thu, Mar 16, 2017 at 10:17 PM, Peter Veerman notifications@github.com wrote:

@yuvipanda https://github.com/yuvipanda what do you think of the backup script so far?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/data-8/jupyterhub-k8s/pull/142#issuecomment-287269222, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23p4XOHAhg8n72XhcD9zK8sZzmXiJks5rmhdqgaJpZM4MgDvE .

-- Yuvi Panda T http://yuvi.in/blog

jiefugong commented 7 years ago

hi @yuvipanda, thank you for the quick comment. to comment on what the script does right now, it is more or less a more organized and glorified version of the script:

gcloud compute disks list | grep gke-prod-49ca6b0d-dyna-pvc | awk '{ print $1; }' | xargs -L1 gcloud compute disks snapshot

however, i realize this is probably not the best way to go about do things. i wasn't sure how quickly this was to be enabled, so i just wanted to get it working first. if you'd like, i'd assume the next step to take would be to use the kubernetes client as you mentioned to look at the disks associated with each notebook PVC?

tonyyanga commented 7 years ago

Update: sorry I think I misunderstood some parts about the client. Just deleted my comment earlier.

Maybe we could do better with error handling? Now with an error from the google cloud api client, it will throw a googleapiclient.errors.HttpError if API call does not get a HTTP 2xx response. Maybe you want to log such exceptions before throwing them, in case you want to let it run by cron.

jiefugong commented 7 years ago

@yuvipanda if you could review this at your leisure. Some notes:

  1. Added Kubernetes client and now the backup script does the following:

    • Get a list of disk names to snapchat based on what type their pod is (this can easily be changed to namespace, or whatever else you would like to use to filter the pods by)
    • Pass disk names to backup-disks.py, where all the persistent disks related to the project are collected
    • These disks are then filtered to match the names passed by Kubernetes, and these disks are snapshotted
    • Snapshots older than 2 days old are cleared (let me know if I should filter snapshots to only clear them in a certain way)
  2. Might be some repeated code between here and the autoscaler script, but I think functionality we can use this first and then extract some shared code (such as what is provided by the Kubernetes client) out to a separate folder or something

yuvipanda commented 7 years ago

I tried running this, and got:

2017-04-03 22:42:14,679 INFO Filtered 229 disks out of 1869 total that are eligible for snapshotting

This seems to imply that only 229 disks are being snapshotted, while more than a thousand should be. Is that right?

Also, can we add commandline params to toggle things individually, and also to set the cluster explicitly? So an invocation can look like backup-disks.py --cluster=prod --backup --delete?

Otherwise looks great :D

jiefugong commented 7 years ago

Hey @yuvipanda thank you for the comments :) I have gone ahead and added the command line arguments as you requested and changed a few more things (I have also added the ability to replace a PV's underlying GCE disk with a new one that is to be later created from a snapshot -- this is done using popen right now because the Python client has some issues with patching)

As for a response to your question, the only thing left to do in my opinion to finish this PR up is to think about how to specify what disks (of all that belong to the project) are eligible for snapshotting in the first place. As of right now, I have specified it such that only pods of type notebook are to be snapshotted right now, but I am not sure if this is the best way of doing it. There's also something @SaladRaider about only making disks that begin with the prefix gke-prod-49ca6b0d-dyna-pvc snapshottable? Please let me know what criteria to filter these disks on and I will be more than happy to update the script :)

jiefugong commented 7 years ago

@yuvipanda, to update my understanding of your previous comment, it looks like from what @SaladRaider told me, only the underlying disks of certain pod types should be eligible for screenshots (hub, notebooks)

if you agree, what is the best way you would like to label these pods? or is there some specified way you would like for me to filter disks?

yuvipanda commented 7 years ago

Sorry, been swamped!

The general strategy I think should be:

  1. Take as input a namespace
  2. Find all the PVCs in that namespace
  3. Back them all up.

That should be good enough for us, I think

jiefugong commented 7 years ago

@yuvipanda done. let me know what you think, but i have tested this and it should work and reflect your new comments. i will write some more documentation soon, but this looks good to go to me in terms of functionality

yuvipanda commented 7 years ago

@jiefugong cool! Do you wanna meet up for an hour sometime this week to finish this and get it deployed?

jiefugong commented 7 years ago

@yuvipanda absolutely! would you like to arrange on slack a time to meet or otherwise let me know what times work best for you this week? :) i am free most mornings or later in the afternoon.

jiefugong commented 7 years ago

@yuvipanda you free sometime this upcoming week? i am pretty flexible, would love to get this deployed.

yuvipanda commented 7 years ago

Hey! Yes, monday afternoon? say, 2pm? can you send a calendar invite to yuvipanda@berkeley.edu?

On Sat, Apr 15, 2017 at 12:53 PM, jiefugong notifications@github.com wrote:

@yuvipanda https://github.com/yuvipanda you free sometime this upcoming week? i am pretty flexible, would love to get this deployed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/data-8/jupyterhub-k8s/pull/142#issuecomment-294314429, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23l0cDG_nbDbLCg64V3O_S4uzx6x_ks5rwSAsgaJpZM4MgDvE .

-- Yuvi Panda T http://yuvi.in/blog

jiefugong commented 7 years ago

Hi @yuvipanda, sorry for the super late response. Would tomorrow at 3 PM work instead? I've got class until then but would still love to meet up. If not please let me know what other times you're free this week and I'll do my best to accommodate :)

jiefugong commented 7 years ago

@yuvipanda I think we're ready for merge now -- added all the advice you gave this afternoon. Will add Slack support soon, lmk if i missed anything!

yuvipanda commented 7 years ago

\o/