Can kube-throttler solve thundering herd problem (aka stampede problem)

buckleyGI commented 1 year ago

I came across kube-throttler when looking for a solution to our pressing problem (still not solved) when we bring back a great number of pods after a blackout period.

Let me try to explain our situation. At night we scale all deployments to 0 replicas (and store the original value in an annotation) to reduce costs. In the morning the replicas are restored to their original values. This is where the problem starts as they overburden the node where the CPU plateaus to 100%. The warmup probes start to fail and, if they make it, the liveness probes will start restaring the pods. Too many pods are trying to get up at the same time... After about 2 hours the system finally stabilizes. If we startup the pods manually, say 5 per minute, the system comes up in less then 10 minutes (!). In other domains this is called the thundering herd problem or stampede problem and we haven't found a fix yet for Kubernetes.

Is this where kube-throttler can help us? I understand that kube-throttler takes into account the CPU utilization but its goal is to be more efficient and not throttle pod creation per se right. Will we be abusing kube-throttler for our use case or is it a good fit you think?

If not it is time for me to learn Go and the Scheduling Framework as it can't be too hard to install a throttle for pod creation on a node where x pods can be created in a sliding window of x minutes. Am I on the right track? Thank you.

everpeace commented 1 year ago

@buckleyGI Thank you for opening up the issue!

If we startup the pods manually, say 5 per minute, the system comes up in less then 10 minutes (!).

Thanks for the detailed description of your issue. Probably, I roughly understood your situation. It seems that you want to limit the number of pods that are starting up (i.e. pods before the liveness/startup probe succeeds) in the cluster.

Is this where kube-throttler can help us? I understand that kube-throttler takes into account the CPU utilization but its goal is to be more efficient and not throttle pod creation per se right.

Precisely, kube-throttler does NOT take into account the CPU utilization but just considers the resource requests of Pods.

Will we be abusing kube-throttler for our use case or is it a good fit you think?

I think it is not the sweet use-case. But, kube-throttler can limit the number of pods that are starting-up in the cluster.

Here is my idea:

First, define a ClusterThrottle resource that throttles the number of pods with starting-up label:

# This ClusterThrottle resource guarantees:
#
#   The number of 
#   scheduled and non-terminated pods with 'starting-up' labels 
#   keeps at most 5 in the cluster.
# 
apiVersion: schedule.k8s.everpeace.github.com/v1alpha1
kind: ClusterThrottle
metadata:
name: starting-up-pods
spec:
throttlerName: kube-throttler
spec:
  selector:
    selectorTerms:
    - namespaceSelector:
        # all the namespace excluding kube-system
        matchExpressions:
        - key: kubernetes.io/metadata.name
          operator: NotIn
          values: ["kube-system"]
      podSelector:
        matchExpressions:
        - key: starting-up
          operator: Exists
  threshold:
    resource counts:
      pod: 5
    # # You can also throttle pods by total resource requests
    # resourceRequests:
    #   cpu: "30"
    #   memory: "30Gi"

  # If you wanted to schedule the limit, please consider to use
  # 'temporaryThresholdOverrides'.
  # See: https://github.com/everpeace/kube-throttler#temporary-threshold-overrides
  # threshold:
  #   # large limit in normal case
  #   resourceCounts:
  #     resourceCounts: 100 
  # temporaryThresholdOverrides:
  #    # this can reduce the limit in the period
  #    # (e.g. 30 minutes after scheduled blackout period)
  #  - begin: 2023-01-01T00:07:00+09:00
  #    end: 2023-01-01T00:07:30+09:00
  #    threshold:
  #      resourceCounts:
  #        pod: 5
  # ...

Then, you just need to control the starting-up label like this:
- a custom webhook that injects a starting-up label to new pods
- a controller that
- watches pods with a starting-up label and
- deletes the label when it detects the pod has started up.

If not it is time for me to learn Go and the Scheduling Framework as it can't be too hard to install a throttle for pod creation on a node where x pods can be created in a sliding window of x minutes. Am I on the right track? Thank you.

I also have an alternative idea. Perhaps, "Pod Scheduling Readiness" (just released in v1.26) will help.

You would also need two components:

A webhook to put starting-up to schedulingGate to new pods
A controller watches running pods, and it finds some pods have started-up, then it will remove several starting-up schedulingGate from pending pods.

Note: In this case, a controller decides the order of pods starting up. Usually, kube-scheduler decides the order of pods be scheduled. I think this is not actually pros/cons. But you can decide which way suits your use case.

everpeace commented 1 year ago

I edited its title to align the issue contents 🙇‍♂️

everpeace commented 1 year ago

I'm closing this due to inactive for long time

everpeace / kube-throttler

Can kube-throttler solve thundering herd problem (aka stampede problem) #57