Open naveenrb98 opened 1 year ago
My guess is that the daskautoscaler_adapt
timer is asking the scheduler what it wants to do and it is trying to scale the worker group. However daskworkergroup_replica_update
is locked for a long time for some reason.
@jmif would you mind having a look at this as I expect it is related to the recent changes in #649
So when using dask cluster on kubernetes with adaptive scaling there are two issues I noticed.
One is repeated scale up or scale down request happening but no scaling happens as result of it even though cluster scaling permission is given to the kubernetes cluster. This image shows the issue. Scale down request was fired every 5 seconds but no scale down happened. Same goes for scale up.
Second issue i came across was regarding kopf.
This is the function that gets called for adaptive scaling which does a replica scale.
While this has happened and cluster is trying to scale to maximum possible workers another function mentioned below pitches in and gives a scale down command because of which cluster never scales the way we want.