Open guillaumeeb opened 2 years ago
Hi there, any updates on this issue? Would it be easier for maintainers if I tried to propose something and we discuss the changes?
I hoped to have some feedback first, but I understand this is not of high priority and people here have many things to do, so I'm happy to give it a try in the coming weeks.
Hello, any updates on this issue (@guillaumeeb did you receive any answer on this)?
Hi there, we ran into an issue in dask-jobqueue that is explained here: https://github.com/dask/dask-jobqueue/issues/498.
To sum up: when we use adaptive mode with a
Cluster
object starting several worker processes in eachJob
, and with a minimum number of workers to 0, adaptive goes into a endless loop starting and stopping jobs before the workers ever connect to theScheduler
once tasks are submitted.I'm thinking this bug is also present with other
SpecCluster
implementation which allow for "grouped" workers in oneProcessInterface
.This can be reproduced (without any job queuing system, but with dask-jobqueue) with the following snippet:
I've narrowed the problem to two places:
I guess for some simple fix, we could either just modify the code in one or both places. We could suppress the target to 1 if no workers (but this is probably here for a reason), or we could prevent stopping not connected workers if no workers have yet arrived. These are just suggestions, I'm really curious of what any other propositions you might have.
Environment: