Open thesuperzapper opened 3 years ago
Interesting approach 👀
Cant wait to see this in action. please let us know once this is available.
Cant wait to see this in action. please let us know once this is available.
@NitinKeshavB I agree, I am sorry it's taken so long!
I actually have a mostly working prototype, but I have paused work on it until I can get the first release of deployKF (a new open-source ML Platform for Kubernetes, which will include Airflow) out the door.
After that, it is top of my list!
Ping :D would this support scale to 0 by any chance?
Hi @thesuperzapper, I'm very interested in this feature as well, and I see that you recently added a new Kubernetes proposal related to controller.kubernetes.io/pod-deletion-cost
. I don't fully grasp the details, but will that change this approach as well? Perhaps more pertinently, will the implementation of this approach depend on the implementation of that Kubernetes proposal?
I prototyped a "Keda Airflow autoscaler", and it doesn't work as good as I expected.
The autoscaler:
The autoscaler is deployed as a new endpoint in Airflow"s API, called every 10 seconds by Keda (metrics-api).
Problem: Sometimes, a task get picked up by an empty worker right after the database is queried (before getting a 'TERM' signal). Which leads to task eviction. This is especially true with short running dynamic tasks.
Some thoughts/ learnings:
cancel consumer
-> determine replicas again (prevent race condition)-> annotate -> patch the replicasThat said, I'm going to give a try to this flow, with the 'simple'/'safe' downscale logic
The chart currently supports primitive autoscaling for celery workers, using HorizontalPodAutoscalers with memory metrics. But this is very flawed, as there is not necessarily a link between RAM usage, and the number of pending tasks, meaning you could have a situation where your workers don't scale up despite having pending tasks.
We can make a task-aware autoscaler that will scale up the number of celery workers when there are not enough task slots, and scale down when there are too many.
In past, scale down was dangerous to use with airflow workers, as Kubernetes had no way to influence which Pods were removed, meaning Kubernetes often removes a busy worker where there are workers that are doing nothing.
As of Kubernetes 1.22, there is a beta annotation for
Pods
managed byReplicaSets
calledcontroller.kubernetes.io/pod-deletion-cost
, which tells Kubernetes how "expensive" killing a particularPod
is when decreasing thereplicas
count.Our
Celery Worker Autoscaler
can perform the following loop:controller.kubernetes.io/pod-deletion-cost
annotationsapp.control.add_consumer()
command, so it resumes picking up new airflow tasksreplicas
for the current task load:load factor
of workers is aboveA
forB
time --> increasereplicas
to meet the targetload factor
load factor
of workers is belowX
forY
time --> decreasereplicas
to meet the targetload factor
load factor
is the number of available task slots which are consumedA
seconds (to prevent a yo-yo effect), (perhaps have separate limits for down and up to allow faster upscaling)minium
andmaximum
replicas configsAIRFLOW__CELERY_KUBERNETES_EXECUTOR__KUBERNETES_QUEUE
_replicas
are going to be decreased byN
:pod-deletion-cost
in ascending orderpod-deletion-cost
is thenumber of running tasks
, weighted by thetotal running time
of each task (so long-running tasks are not needlessly evicted), specifically we want smaller numbers of long-running tasks to be weighted higher than larger numbers of short-running tasksN
worker Pods with the lowest cost Pods with thecontroller.kubernetes.io/pod-deletion-cost
annotationN
by this number, as Kubernetes will remove these pods firstapp.control.cancel_consumer(...)
command, so does not pick up new airflow tasks after being "marked" for deletionreplicas
down byN
Important changes to make this work:
controller.kubernetes.io/pod-deletion-cost
is only for Pods in ReplicaSetscontroller.kubernetes.io/pod-deletion-cost
is alpha in1.21
and beta in1.22
, for older Kubernetes versions we can let users use the CloneSet from the CNCF project called OpenKruise (instead ofDeployment
), as they have back-ported thecontroller.kubernetes.io/pod-deletion-cost
annotation.