Don't do pod count based scaling for an app if a deployment is happening at the moment

estafette / estafette-k8s-hpa-scaler

Kubernetes controller to set minimum replicas from a Prometheus query on annotated HorizontalPodAutoscalers to avoid collapsing deployments in case of errors

https://helm.estafette.io

MIT License

10 stars 6 forks source link

Don't do pod count based scaling for an app if a deployment is happening at the moment #6

Closed markvincze closed 5 years ago

markvincze commented 5 years ago

Attempt at fixing #5

markvincze commented 5 years ago

I realized now that this will also make the pod-based scaling not happen if we have an active canary deployment (because that's also in an additional ReplicationSet).
But I think that's not a huge issue.

JorritSalverda commented 5 years ago

Shouldn't we just change getMinPodCountBasedOnCurrentPodCount to calculate the value not based on current pod count but current min value? And only apply the ratio to limit speed of shrinking if the current desired min value is smaller than than the one stored last in the hpa?

markvincze commented 5 years ago

I'm not sure how that would work.

What I don't see is then when would the minCount ever increase?

Let's say at night the traffic goes down, and we have 10 pods, and the minCount is 9.
During the day the traffic slowly increases, and by peak time we have 20 pods.
But if we calculated the new minCount based on the current minCount, then it'd just stay 9, right? So if the traffic suddenly drops, then nothing would prevent the HPA from scaling down to 9 pods in one go.

Sorry if I miss something.

JorritSalverda commented 5 years ago

The controller sets minReplicas based on a Prometheus query; this query shouldn't result in a higher value during a deployment if you ask me, right?

And if the desired value of minReplicas (coming out of the query) is smaller then the current value we should use the value of the hpa-scaler-scale-down-max-ratio annotation (if set) to slow down the decrease of minReplicas.

I guess it currently uses the actual number of pods instead of the current value of minReplicas to make it independent of the interval of the controller?

An example with increasing traffic

Current minReplicas: 25 Desired minReplicas: 30

We can immediately set minReplicas to 30.

An example with decreasing traffic

Current minReplicas: 25 Desired minReplicas: 10 hpa-scaler-scale-down-max-ratio: 0.9

Since the desired minReplicas is smaller than current minReplicas we take the scale down ratio into account:

We set minReplicas to 0.9 * 25 = 22,5

Does this make sense? Or do I miss something important?

markvincze commented 5 years ago

Ah, clear, now I see what you mean.

The thing is that I'd like to make it possible to use the hpa-scaler-scale-down-max-ratio independently (and even without) the Prometheus-based scaling, because making the scaledown gradual can be useful even (or especially?) in situations when we can't use the Prometheus-based scaling.

JorritSalverda commented 5 years ago

If Prometheus is unavailable we probably shouldn't modify anything, right?

markvincze commented 5 years ago

The way I intend hpa-scaler-scale-down-max-ratio to be used (and that's how we use it now), is that the deployment is scaled with an HPA based on the CPU-load, and then we constantly adjust the minPods based on the hpa-scaler-scale-down-max-ratio and the current pod count. Thereby we prevent sudden downscaling.

So this doesn't need a Prometheus query at all, but it's prone to the issue described in #5

JorritSalverda commented 5 years ago

Ah, I didn't understand you want to be able to run it without Prometheus because I originally created it just for that. But it make sense to slow down the shrinking of deployments even without using Prometheus queries. Perhaps you can check to see if a Prometheus query is set in the annotation. If it isn't limit scale down speed based on the actual pods, otherwise base it on the current minReplicas in the hpa.

It's just that figuring out if a deployment is running with this brute force approach seems prone to error as we grow the number of deployments. Since the Kubernetes API doesn't do paging it might take tool long to respond or use up too much memory. If that makes sense.

I looked for alternatives to figure out if a deployment was running, but couldn't find anything useful.

markvincze commented 5 years ago

If it isn't limit scale down speed based on the actual pods

I think that's what we're doing now, and that's causing the issue.

Since the Kubernetes API doesn't do paging it might take tool long to respond or use up too much memory. If that makes sense.

I tested this now: in our prod cluster with 2878 ReplicaSets it took between 4-6 seconds to retrieve them (although this is a client running on my machine and using kubectl proxy, so with an in-cluster client it might be faster), and the memory usage increases by 46MBytes when the ReplicaSets are retrieved.
So you're right, this is significantly heavier than retrieving the HPAs (that takes ~1 second and uses only 3MB memory for the 327 HPAs we have on prod)

I'm not sure what the best workaround would be at this point, I'll try to think too if I can come up with some alternative approach.