Autoscaler closes workers as others are starting/scaling up

timblakely commented 1 year ago

It appears that the latest Operator doesn't take starting up workers into account when scaling down and instead seems to scale down happily running workers...?

I've got a workload that uses GPUs, which means the containers are sometimes pretty large (>10GB) due to all the various CUDA/JAX/CUDNN support libraries. This means that sometimes it can take quite a while for the nodes to download and start up the image, causing various pods to be in a state of Unscheduleable (if node pool is still warming up) or CreatingContainer (while they download the image). If the autoscaler is triggered during this time and it decides it needs to scale down the workers, instead of scaling down the pending containers it seems to turn down existing workers first (?). Here's a scenario:

Launch the cluster, scaling the Dask worker pool to a fixed 1 worker.
1x Scheduler and 1x worker are created (gpu-ephemeral-scheduler and gpu-ephemeral-default-worker-d57c128a53, respectively)
Set worker pool to scale from 1 to 60 workers.
Launch graph with ~4k tasks
Worker d57c128a53 begins to chew through tasks.
Operator scales worker pool up to 37 workers: [2023-02-24 00:27:40,960] kopf.objects [INFO ] [default/gpu-ephemeral-default] Scaled worker group gpu-ephemeral-default up to 37 workers.
Various pods are unschedulable due to the node pool still warming up (GPU nodes are expensive to keep running ;)
Initial worker makes quick work of the first ~400 tasks since the backing data source is empty in these regions and the task is skipped
Worker finally hits some "hard" tasks and begins processing, slowing down the rate at which the graph is completed (IIUC the metric used for autoscaling)
Since the graph seems to be processed at a decent clip, the Operator begins scaling down the cluster to 25 workers: [2023-02-24 00:27:45,361] kopf.objects [DEBUG ] [default/gpu-ephemeral-default] Updating diff: (('change', ('spec', 'worker', 'replicas'), 37, 25),)
- Note: not sure if it's related, but I occasionally see that the HTTP API calls fail and fall back to Dask RPCs: [2023-02-24 00:27:45,645] kopf.objects [INFO ] [default/gpu-ephemeral-default] Scaling gpu-ephemeral-default failed via the HTTP API, falling back to the Dask RPC
Since the only running worker is the currently processing worker d57c128a53 - all others are still "Unschedulable" or "ContainerCreating" - the Operator decides to close it: [2023-02-24 00:27:45,660] kopf.objects [INFO ] [default/gpu-ephemeral-default] Workers to close: ('gpu-ephemeral-default-worker-d57c128a53',)
The scheduler notices the worker is gone: 2023-02-24 00:27:45,695 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://10.4.0.32:38703', name: gpu-ephemeral-default-worker-d57c128a53, status: closing, memory: 209, processing: 2>
This in turn causes the scheduler to think all workers are dead and abandon the pipeline:

2023-02-24 00:27:45,787 - distributed.scheduler - INFO - Lost all workers
2023-02-24 00:27:45,793 - distributed.scheduler - INFO - Remove client Client-08e43dd0-b3da-11ed-b911-4201c0a82289
2023-02-24 00:27:45,794 - distributed.scheduler - INFO - Close client connection: Client-08e43dd0-b3da-11ed-b911-4201c0a82289
2023-02-24 00:27:45,978 - distributed._signals - INFO - Received signal SIGTERM (15)
2023-02-24 00:27:45,978 - distributed.scheduler - INFO - Scheduler closing...
2023-02-24 00:27:45,979 - distributed.scheduler - INFO - Scheduler closing all comms
2023-02-24 00:27:45,979 - distributed.scheduler - INFO - Stopped scheduler at 'tcp://10.4.4.12:8786'
2023-02-24 00:27:45,980 - distributed.scheduler - INFO - End scheduler

Ideally the Operator would choose to terminate those workers/pods whose status is either "Unschedulable" or "ContainerCreating" before terminating "Running" pods. I can confirm that fixing the worker pool size and disabling adaptation doesn't show this behavior.

jacobtomlinson commented 1 year ago

Thanks for raising such a well-written issue. We are currently working through some challenges around adaptive autoscaling. See #633 #649 #648. This seems like a different challenge we can look to resolve once the other adaptive changes land.

Totally agree that the controller should look for pending pods and remove them before asking the scheduler for candidates to remove.

The HTTP API warning you mention is expected, the HTTP API isn't enabled by default in distributed. The implementation is there ready for the day that gets turned on upstream.

BitTheByte commented 1 year ago

I'm also experiencing this using 2023.3.2

dask / dask-kubernetes

Autoscaler closes workers as others are starting/scaling up #659