Open timblakely opened 1 year ago
Thanks for raising such a well-written issue. We are currently working through some challenges around adaptive autoscaling. See #633 #649 #648. This seems like a different challenge we can look to resolve once the other adaptive changes land.
Totally agree that the controller should look for pending pods and remove them before asking the scheduler for candidates to remove.
The HTTP API warning you mention is expected, the HTTP API isn't enabled by default in distributed. The implementation is there ready for the day that gets turned on upstream.
I'm also experiencing this using 2023.3.2
It appears that the latest Operator doesn't take starting up workers into account when scaling down and instead seems to scale down happily running workers...?
I've got a workload that uses GPUs, which means the containers are sometimes pretty large (>10GB) due to all the various CUDA/JAX/CUDNN support libraries. This means that sometimes it can take quite a while for the nodes to download and start up the image, causing various pods to be in a state of
Unscheduleable
(if node pool is still warming up) orCreatingContainer
(while they download the image). If the autoscaler is triggered during this time and it decides it needs to scale down the workers, instead of scaling down the pending containers it seems to turn down existing workers first (?). Here's a scenario:gpu-ephemeral-scheduler
andgpu-ephemeral-default-worker-d57c128a53
, respectively)d57c128a53
begins to chew through tasks.[2023-02-24 00:27:40,960] kopf.objects [INFO ] [default/gpu-ephemeral-default] Scaled worker group gpu-ephemeral-default up to 37 workers.
unschedulable
due to the node pool still warming up (GPU nodes are expensive to keep running ;)[2023-02-24 00:27:45,361] kopf.objects [DEBUG ] [default/gpu-ephemeral-default] Updating diff: (('change', ('spec', 'worker', 'replicas'), 37, 25),)
[2023-02-24 00:27:45,645] kopf.objects [INFO ] [default/gpu-ephemeral-default] Scaling gpu-ephemeral-default failed via the HTTP API, falling back to the Dask RPC
d57c128a53
- all others are still "Unschedulable" or "ContainerCreating" - the Operator decides to close it:[2023-02-24 00:27:45,660] kopf.objects [INFO ] [default/gpu-ephemeral-default] Workers to close: ('gpu-ephemeral-default-worker-d57c128a53',)
2023-02-24 00:27:45,695 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://10.4.0.32:38703', name: gpu-ephemeral-default-worker-d57c128a53, status: closing, memory: 209, processing: 2>
Ideally the Operator would choose to terminate those workers/pods whose status is either "Unschedulable" or "ContainerCreating" before terminating "Running" pods. I can confirm that fixing the worker pool size and disabling adaptation doesn't show this behavior.