Closed KatharineShapcott closed 3 years ago
Sorry that was a lot of suggestions! I'd be happy to test out cluster.adapt() if you think it would be useful :)
Hey Katharine!
Thanks for reporting this! Unfortunately, the current setup is pretty rigid with respect to active workers (the code only checks for active workers at startup, then it just assumes everything is fine). So yes, I'd really appreciated If you would be willing to test cluster.adapt()
- as you pointed out, I think that might be the better way to go for large worker counts (it has been a litle finicky w/SLURM in the past, but in the meantime dask
as well as dask_jobqueue
got pretty substantial updates, so def. worth checking it out again).
RE: timeouts - I totally agree. If you don't mind, I'll go ahead and split up this issue into several bug-reports/feature requests.
Hi Stefan, It looks to me like the code doesn't recognize if the number of workers change due to a timeout. At least the printouts don't change. Nothing crashed though so maybe it doesn't matter?
~PS I think 180 seconds might be a bit too long for the timeout, maybe we can switch to 60s? Because it happens every time if you have more than a few hundred jobs. Or we could display these options after 60s and if nothing happens for another 120s we could automatically continue?~ see #13 ~Or in slurmfun jobs would all go in the cue and run when resources became available. Maybe we should be using
cluster.adapt()
instead?~ see #14