`SubprocessCluster` and `SSHCluster` hang indefinitely if `distributed` log level is `WARN` or higher

dask / distributed

A distributed task scheduler for Dask

https://distributed.dask.org

BSD 3-Clause "New" or "Revised" License

1.58k stars 718 forks source link

`SubprocessCluster` and `SSHCluster` hang indefinitely if `distributed` log level is `WARN` or higher #8393

Open hendrikmakait opened 11 months ago

hendrikmakait commented 11 months ago

Both clusters rely on the scheduler address to be logged to stderr. The address is logged as INFO, so setting the log level higher than that will cause the clusters to hang indefinitely.

Related: #8392

hendrikmakait commented 11 months ago

One possible solution for the SubprocessCluster would be to write the file to a temporary file (e.g., /tmp/dask/<pid>) and reading it from there.

hendrikmakait commented 11 months ago

TIL: There's the scheduler_file which we should be able to leverage for this.

jomey commented 7 months ago

Curious if there is any known workaround for SSHCluster?

I am trying to deploy a SSHCluster using a JupyterLab environment and the output of the cell gets very verbose with many workers.

hendrikmakait commented 7 months ago

@jomey, if you are up for the challenge, I suppose you could use #8398 as a blueprint for the changes necessary to the SSHCluster. IIRC, the code of the SSHCluster looks very similar to what I fixed in that PR. (I'm currently out on PTO, so I won't be able to have a closer look at this for a few weeks.)

jomey commented 7 months ago

Taking the first and trying to set up my local machine (Ubuntu, 22.04 LTS). I have installed and configured a local ssh server that accepts key-less login. Then I setup the environment according to the test.yaml

With that, I can not get the current ssh tests to pass. All failures come back with the same message: RuntimeError: Cluster failed to start: Worker failed to start

Any insights on how to set up a dev environment for this? (Permission issues?)

hendrikmakait commented 7 months ago

Maybe @jacobtomlinson or @jrbourbeau are able to help?

jomey commented 7 months ago

Found the reason for a few test failures. I had an older (forgotten) dask.yaml under my user in .config/dask/, which also set the log levels to critical (exactly this issue, which I run into using a different machine). Renaming that entire config folder only leaves all the tests with old_ in them (3 total). Not sure if that is of concern.

This makes me wonder if there should be a new issue logged that tries to make the tests more resilient against such local user configs that are not part of the repository?