Autoscaling for nodes (working with Google Cloud Platform)

tonyyanga commented 7 years ago

Closes #74

Summary

Provide auto-scaling without killing user pods, based on Kubernete's Unschedulable tag for nodes. See scale/README.md for expected scaling behaviors.

How to test

See README.md for directions. Please make sure kubectl proxy is ready before running scale.py

Comments requested

How do you want the scaling script be run? Some options:

Event driven - run as a pod and receive HTTP requests from cull, hub, etc
Running on provisioner - with cron or add python script to run again

How to set the parameters? Including min utilization, max utilization, etc

Is it safe to kill a node running cull on it? General proposals for other critical service to be protected?

What tests do you want to run with it? We have run some tests but far from comprehensive.

yuvipanda commented 7 years ago

Awesome! I'll take a more detailed look tomorrow :)

Early feedback:

We should use https://github.com/kubernetes-incubator/client-python to talk to the Kubernetes API, requests + json directly. Kubespawner should also be ported to use it eventually.
Have you considered using https://cloud.google.com/compute/docs/tutorials/python-guide for talking to the google cloud API, rather than calling the gcloud command line tool? I think that'll be much nicer in the long term...
How do you think this could be extended in the future to other providers (Azure, etc) easily? Using classes rather than just functions might be a great fit here!
Is there a reason we can't just run this script inside a kubernetes cluster? Esp. with (1) and (2), this would make authentication easier. We could use a https://kubernetes.io/docs/user-guide/cron-jobs/ even?
Parameters should ideally be set as environment variables directly. https://12factor.net/ (specifically https://12factor.net/config) has more info. If we do (4) we can use ConfigMaps to pass config into the script via environment variables.
I think cull and statsd are ok to be killed, hub and proxy aren't.

This is otherwise awesome, thank you for getting a move on this!

If it works fine we can also just start using it soon, and make changes to it slowly. I do want us to do it the 'right way' tho, since both us and others will be using this for a long time :)

tonyyanga commented 7 years ago

Thank you Yuvi for you feedback!

I am not sure if the python client is a good idea, because it does not seem to be designed for async/coroutine requests. It does provide callback style async IO, but I am not sure if it could easily work with the coroutine style of tornado.
I definitely agree that google cloud API calls are much better than gcloud command line. We could migrate to that in the future. In addition to being lazy, we are also uncertain about how to deal with authentication, because it has not been decided how we are going to deploy the script.
It is my personal decision not to use OOP style class with gcloud_update.py. From my perspective classes should introduced when there are data to be associated with functions, which does not seem to be the case in the script so far. We might see the necessity after we start to use google cloud APIs, and it won't be difficult to switch.
Reasons for not run this version of the script inside a container: 4a. We need to replace kubectl proxy with recommended authentication settings when "Accessing the API from a Pod", involving service accounts. I prefer to do that after it is certain that we need such complexity. See https://kubernetes.io/docs/user-guide/accessing-the-cluster/#accessing-the-api-from-a-pod for details. 4b. Uncertain how to deal with google cloud authentication credentials 4c. Uncertain if I should run populate.bash just inside a container
Make sense. Honestly I might prefer command line arguments.
I will fix that soon, if you are sure that diagnostics data will not be lost when statsd is killed.

Additional point regarding populate.bash: @yuvipanda it will be great if you could incorporate a python version of that in the autoscaler gcloud_update.py.

Thanks.

yuvipanda commented 7 years ago

Thanks for the response!

I'm confused about where async code is being used right now - I don't see any yields. requests, the current library being used, is also not async. I'm also not sure why this code needs to be async?
Sensible, and I agree. But, see point 3.
While I agree that encapsulation is a good line to draw for 'class or functions', I think it is the wrong one in this case. Specifically, we'll want this to work on Azure too before end of semester - and defining an interface that can be implemented by various 'providers' is a decent use case for a class in python. Doing this without classes involves a lot of 'ifs' scattered around the code, usually.
I really really think we should be using the kubernetes API client for this - it does all of this automatically for us already. I do want this to run on the k8s cluster - one of the advantages of this year's deployment over last year's is that everything is managed and run in one way (kubernetes), and I'd like to very much keep it this way. If we have to run this elsewhere, then we either end up with infrastructure that isn't 100% reproducible, or we've to use ansible / puppet / some-other-thing in addition to kubernetes to provision and run this script reproducibly in some other way.

For populate.bash, it is something we should replace with an in-cluster replacement soon anyway - a DaemonSet that is pulling all images required on the cluster. I'll work on that this week so we won't have to do populate.bash anymore - you can ignore that from your autoscaler.

I just woke up, so I'll actually read through the code in a few hours :)

tonyyanga commented 7 years ago

Thank you Yuvi!

We have not yet figured out whether the scale function should be called in an event-driven way or run by cron (or similar) after every certain time period, as I specified in the original pull request. If it needs to be called in an event-driven way, I might introduce tornado or other async web functions. Async won't be needed if eventually it is called by cron.
I agree with your point. I think that it is better to write 'interfaces' after we understand enough about other providers including Azure, so that we don't have to modify our interfaces frequently. We could do that later in this semester.
I do personally think that such a service should run as a prod like cull (and be event driven, honestly). We can try to figure out the authentication parts and make it work as a pod in the near future.

Thank you we will take the populate.bash out. :)

On Mon, Feb 20, 2017 at 1:17 PM Yuvi Panda notifications@github.com wrote:

Thanks for the response!

I'm confused about where async code is being used right now - I don't see any yields. requests, the current library being used, is also not async. I'm also not sure why this code needs to be async?

Sensible, and I agree. But, see point 3.

While I agree that encapsulation is a good line to draw for 'class or functions', I think it is the wrong one in this case. Specifically, we'll want this to work on Azure too before end of semester - and defining an interface that can be implemented by various 'providers' is a decent use case for a class in python. Doing this without classes involves a lot of 'ifs' scattered around the code, usually.

I really really think we should be using the kubernetes API for this - it does all of this automatically for us already. I do want this to run on the k8s cluster - one of the advantages of this year's deployment over last year's is that everything is managed and run in one way (kubernetes), and I'd like to very much keep it this way. If we have to run this elsewhere, then we either end up with infrastructure that isn't 100% reproducible, or we've to use ansible / puppet / some-other-thing in addition to kubernetes to provision and run this script reproducibly in some other way.

For populate.bash, it is something we should replace with an in-cluster replacement soon anyway - a DaemonSet that is pulling all images required on the cluster. I'll work on that this week so we won't have to do populate.bash anymore - you can ignore that from your autoscaler.

I just woke up, so I'll actually read through the code in a few hours :)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/data-8/jupyterhub-k8s/pull/117#issuecomment-281183212, or mute the thread https://github.com/notifications/unsubscribe-auth/AFCJNr8TS99cEkV6vEUJoGavGFKaYwvPks5regLagaJpZM4MFy_g .

-- Best regards,

Sincerely, Tony Yang

yuvipanda commented 7 years ago

I just realized - when scaling 'up', I think we should first look for nodes marked unschedulable and re-mark them as schedulable before adding more nodes. You could otherwise end up in a situation where you are using a lot more nodes than necessary simply because some are marked unschedulable.

So scale up logic should check if there are any nodes marked 'unschedulable' before creating new nodes, and if they are it should just mark them as schedulable instead...

tonyyanga commented 7 years ago

@yuvipanda I believe that the current version will first try to mark nodes schedulable to satisfy the demand. It will only scale up nodes if it cannot satisfy the demand in such ways.

tonyyanga commented 7 years ago

@yuvipanda I feel that I have added enough debug logs to track down specific action taken by the autoscaler and addressed other immediate concerns.

yuvipanda commented 7 years ago

Awesome. I'll take another look later today :)

On Feb 20, 2017 4:21 PM, "Tony Yang" notifications@github.com wrote:

@yuvipanda https://github.com/yuvipanda I feel that I have added enough debug logs to track down specific action taken by the autoscaler and addressed other immediate concerns.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/data-8/jupyterhub-k8s/pull/117#issuecomment-281211509, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23sd_p7vxtbj-nHeXXbCGCFnvcU_Pks5rei37gaJpZM4MFy_g .

tonyyanga commented 7 years ago

@yuvipanda

I did not remove the one-line functions because Kubernetes Python client describes a node in a different way. The python client used a v1Node class, while parsing the JSON response from the REST API calls will create a dict. These one-line functions provide abstraction to avoid changing the logic code when switching between node.metadata.name and node["metadata"]["name"].

yuvipanda commented 7 years ago

@tonyyanga but you're no longer using node["metadata"]["name"] anymore, right? so why keep them?

tonyyanga commented 7 years ago

@yuvipanda

In the future we might decide to switch to native REST API calls. The abstraction provided by those one-line functions preserved the code out of utils.py when I switch between Kubernetes Python client's node.metadata.name and native JSON's node["metadata"]["name"].

Two possible cases exist:

Switch to native REST API calls for async io. Will be necessary if the autoscaler receive external calls.
When documentation of the python client seems too vague. Despite your support for the python client, I have to say that its documentation is less clear than Kubernete's REST APIs.

A quick example for 2: V1Node patch_node(name, body, pretty=pretty) is called when marking a node schedulable or unschedulable. It is documented in the link below: https://github.com/kubernetes-incubator/client-python/blob/master/kubernetes/docs/CoreV1Api.md#patch_node

The documentation is incorrect. First, in the example code kubernetes.client.UnversionedPatch() simply causes error. Second, a valid body turns out to be a V1Node object, which can not be told from the documentation. Third, and probably the most important one, it provides no choice between application/json-patch+json, application/merge-patch+json, application/strategic-merge-patch+json.

Please see https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#patch-operations for details about different PATCH strategies.

yuvipanda commented 7 years ago

While I agree with the documentation issue, I don't think switching back to writing our own client implementation is something we would/should ever do. Consider the amount of authentication code you'd need to write :) And if for some reason we want this to be a web application, we can always use flask+uwsgi or something along those lines. In general, code shouldn't be async unless it needs to be, and when it does you need to change a ton of fundamental things anyway. Same with switching back to calling the API directly -

For the documentation issue, we should be good open source citizens and send them issues and patches :D

Keeping extra code around for possible future cases is a violation of the very useful YAGNI principle (https://martinfowler.com/bliki/Yagni.html). It's the same reason I'm ok with us not using classes or abstracting for multiple providers right now (since we do not need it yet). I do think we should have our code be as simple and small as possible, with just as much (not more and not less!) amount of indirection.

yuvipanda commented 7 years ago

Haven't looked at the code again (sorry!). Should we do a test run of this on prod this week? I'll be happy if we manage to downsize it just once, even.

tonyyanga commented 7 years ago

Sorry I have been busy too. I am free between 1PM and 4PM tomorrow and can show up at BIDS to do the test together (in case anything bad happens :) ). Let me know when and where will be good for you.

yuvipanda commented 7 years ago

I've to be at a clinic all day tomorrow - how about thursday?

On Tue, Feb 21, 2017 at 10:35 PM, Tony Yang notifications@github.com wrote:

Sorry I have been busy too. I am free between 1PM and 4PM tomorrow and can show up at BIDS to do the test together (in case anything bad happens :) ). Let me know when and where will be good for you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/data-8/jupyterhub-k8s/pull/117#issuecomment-281582645, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23kJCMkrSrPoLPcD6phF4VnCpEcqPks5re9cogaJpZM4MFy_g .

-- Yuvi Panda T http://yuvi.in/blog

tonyyanga commented 7 years ago

How about Thursday afternoon at 1:30PM? Maybe we could chat on Slack.

yuvipanda commented 7 years ago

I think we should also not use 'current-context' - which is what it does right now. 'dev', 'prod' or 'playground' must be explicitly set as a commandline parameter.

Doing things on the wrong cluster has already deleted everyone's homework once (I did it accidentally, thanksfully this was before classes started). So we should not rely on 'current context' at all anywhere.

jiefugong commented 7 years ago

Scaling up and down now successfully use the Google Cloud Python Client. Maybe you could look into some of the code for this @yuvipanda? :)

tonyyanga commented 7 years ago

@yuvipanda

Addressed the most urgent concerns we discussed on Wednesday.

Now it uses google cloud python client and includes several safety measures for admin mistakes.

I think the pull request can be merged now, with future improvements tracked using issues.

yuvipanda commented 7 years ago

The hub and proxy pods now have a label:

hub:
  labels:
    hub.jupyter.org/autoscaler-critical: "true"

proxy:
  labels:
    hub.jupyter.org/autoscaler-critical: "true"

This is set in supercommon.yaml in the config repo

yuvipanda commented 7 years ago

Heya! I am back and working now again, so let's get this going sometime this week?

tonyyanga commented 7 years ago

Sure. If you need me to show up in BIDS, how about Tuesday after 4PM?

yuvipanda commented 7 years ago

@tonyyanga sure! let's do that!

tonyyanga commented 7 years ago

@yuvipanda Cool. I will get to BIDS around 4:10PM. See you then!

yuvipanda commented 7 years ago

I sent a calendar invite, hopefully to the correct email :)

On Mon, Apr 3, 2017 at 4:00 PM, Tony Yang notifications@github.com wrote:

Sure. If you need me to show up in BIDS, how about Tuesday after 4PM?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/data-8/jupyterhub-k8s/pull/117#issuecomment-291302743, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23q5j4LGpnJO1sfTm46Wm0VS98a-Aks5rsXoFgaJpZM4MFy_g .

-- Yuvi Panda T http://yuvi.in/blog

tonyyanga commented 7 years ago

Hey, Yuvi,

I had a few more bug fixes and updated the README to reflect our updated scaling logic. It so far works nicely on the prod cluster, without actually executing the commands.

I will try to actually run scale actions tonight or tomorrow morning to make sure it work as it claims, or you can run it if you know better about the lab schedule.

I will check tonight and see other steps we need to take, e.g. changing labels.

Thank you!

tonyyanga commented 7 years ago

@yuvipanda

Here are two things that I think that could be done:

Not all students pods are labeled with hub.jupyter.org/autoscaler-critical: "true". This is at least the case with stat28 namespace on prod cluster.
Pods of our jupyter deployment (in other words, non-student pod) that can be killed need to be labeled in a different way, so that the autoscaler knows that the nodes hosting them can be condoned or shut down.

yuvipanda commented 7 years ago

@tonyyanga for labeling, I can tomorrow:

label proxy and hub pods with hub.jupyter.org/autoscaler-type: "critical"
label user pods with hub.jupyter.org/autoscaler-type: "ephemeral"

That sounds like a better labeling scheme to me than just bools. Does that sound ok or would it be too complex to implement in the scaler?

Another alternative might be:

label proxy and hub pods with hub.jupyter.org/autoscaler-critical: "always"
label user pods with hub.jupyter.org/autoscaler-critical: "ephemeral"

I'll also make sure we do a deploy to stat28 so it has the labels.

tonyyanga commented 7 years ago

@yuvipanda

For the current autoscaler, only key rather than value of the label is supported for labeling pods. Specifically, we should label pods like cull that can be arbitrarily killed with a different label key, such as hub.jupyter.org/autoscaler-omit: "true".

If you find it more straightforward to use the value of certain specified key, I can also make changes accordingly.

tonyyanga commented 7 years ago

@yuvipanda

Also, I realized that the puller-pod will not be killed by cull. It will stay there with PodComppleted status. Either we could omit them or we need to release them somehow.

I am working on a bug with shutting down empty nodes.

yuvipanda commented 7 years ago

@tonyyanga don't worry about puller-pod, I'm working on making those properly cleaned up by themselves.

yuvipanda commented 7 years ago

We ran this together and everything seems ok \o/

Current plan is for @tonyyanga to run this manually 2 times a day for the next week, and then we can look at running this automatically.

Awesome work, @tonyyanga and @jiefugong :D

berkeley-dsep-infra / jupyterhub-k8s