Open chrisroat opened 4 years ago
Typically, task stealing would handle this type of situation: https://distributed.dask.org/en/latest/work-stealing.html Do you have it disabled?
Sometimes, if the tasks take too long, the discovery of when to steal can take too long. Some groups have played with setting task-durations in the config: https://github.com/dask/distributed/blob/08d334e2e18bd977752eeab87e2c09272a2ac829/distributed/distributed.yaml#L27-L30
Thanks for the tip on the task duration settings.
I have not disabled work stealing. It might be interplay with the auto-scaling, as the workers aren't being created to do the stealing. I will look into specifying long task durations, as perhaps that will entice the autoscaler to add more workers.
I have a graph that contains 300 tasks which take ~hour. If I start up a cluster with >300 nodes before submitting a graph, the work gets distributed nicely and the tasks run in parallel. If I cut it too close to 300 (especially if the nodes are preemptible), I will run into a few nodes getting multiple tasks. This isn't too big a deal - I give it a few extra nodes and sometimes it runs a 2 or 3 hours.
On the other hand, if I use a fresh (initial size 0) autoscaling dask cluster and set it to scale from 0-350, a lot of my long tasks pile up on individual workers. This is because K8s can take time to add nodes, and the graph starts processing as soon as 1 or 2 nodes are up. The scheduler does not know yet about the coming workers, and so a few workers get assigned many tasks. The autoscaling in this cases stops around ~30 nodes.
Is my best solution to avoid autoscaling with fresh clusters and graphs like this?
I'm using dask/distributed 2.19, and dask-gateway 0.7.1.