dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

starting dask.distributed.Client with default settings results in endless restarting of workers #7034

Open radioflyer28 opened 2 years ago

radioflyer28 commented 2 years ago

What happened:

What you expected to happen: for the client to start

Minimal Complete Verifiable Example:

This fails:

from dask.distributed import Client
client = Client(processes=True)

This starts:

from dask.distributed import Client
client = Client(processes=False)

Anything else we need to know?:
Running from Jupyter notebook

Environment:

Error output on failing example:

2022-09-14 09:17:43,926 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,939 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,949 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,962 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,969 - distributed.nanny - WARNING - Restarting worker

Traceback (most recent call last):
  File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 822, in _wait_until_connected
    msg = self.init_result_q.get_nowait()
  File "c:\Users\myuser\Miniconda3\envs\sim\lib\multiprocessing\queues.py", line 135, in get_nowait
    return self.get(False)
  File "c:\Users\myuser\Miniconda3\envs\sim\lib\multiprocessing\queues.py", line 116, in get
    raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\utils.py", line 799, in wrapper
    return await func(*args, **kwargs)
  File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 539, in _on_worker_exit
    await self.instantiate()
  File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 438, in instantiate
    result = await self.process.start()
  File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 695, in start
    msg = await self._wait_until_connected(uid)
  File "c:\Users\myuser\Miniconda3\envs\sim\lib\site-packages\distributed\nanny.py", line 824, in _wait_until_connected
    await asyncio.sleep(self._init_msg_interval)
  File "c:\Users\myuser\Miniconda3\envs\sim\lib\asyncio\tasks.py", line 652, in sleep
    return await future
asyncio.exceptions.CancelledError
2022-09-14 09:17:43,926 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,939 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,949 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,962 - distributed.nanny - WARNING - Restarting worker
2022-09-14 09:17:43,969 - distributed.nanny - WARNING - Restarting worker
...
radioflyer28 commented 2 years ago

A little update, I progressively tried rolling back Dask versions from 2022.9.0 to 2021.3.0 and tested each and saw the same issue. My theory was that shouldn't happen since 2021.3.0 was ok before.

So, next to rule out any environment oddities, I reverted to my old conda environment (via a yaml backup) with Python 3.8.8 and Dask 2021.3.0 and am still seeing the same restarting worker behavior... This time I ran it with the normal interpreter too (not Jupyter) and it made no difference. A bit perplexing...

GhassanGuessous commented 2 years ago

This is happening for me also but just in Zeppelin (jupyter is working fine)

specifying processes=True (which is the default) for LocalCluster or Client make the paragraph hangs forever in zeppelin (see examples below):

%python
from dask.distributed import Client
client = Client()
client

or

%python
from dask.distributed import LocalCluster
cluster = LocalCluster()
cluster

From the Dask documentation, the client: will check your local Dask config and environment variables to see if connection information has been specified. If not it will create an instance of LocalCluster and use that.

so the problem is rather in the LocalCluster with the processes param set to True for some reason. Could you please advise why this is only happening for zeppelin, we are really stuck... Thank you in advance.

radioflyer28 commented 2 years ago

After combing through old issues, I believe my issue is a duplicate of this one (https://github.com/dask/distributed/issues/5574). Kind of amazing this issue still exists after all these years, it seems to be a very low priority for Microsoft... This is the issue in VSCode's repo (https://github.com/microsoft/vscode-jupyter/issues/2962).

My problem went away once I stopped using the VSCode Python Interactive Window (I switched back to Atom and its Hydrogen plugin). Of course, running the script from the shell is ok too.