dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 720 forks source link

Jupyter server won't start from SpecCluster #6886

Open bnaul opened 2 years ago

bnaul commented 2 years ago

What happened: Testing out the new jupyter flag on distributed 2022.8, I ran into the following error:

Minimal Complete Verifiable Example:

from distributed import SpecCluster, Scheduler
SpecCluster(
    scheduler={"cls": Scheduler, "options": {'protocol': 'tcp://', 'interface': None, 'host': '0.0.0.0', 'port': 0, 'dashboard_address': ':8787', 'jupyter': True}}
)

Output:

[I 2022-08-15 15:22:14.752 ServerApp] jupyterlab | extension was successfully linked.
[I 2022-08-15 15:22:15.019 ServerApp] notebook_shim | extension was successfully linked.
[W 2022-08-15 15:22:15.036 ServerApp] All authentication is disabled.  Anyone who can connect to this server will be able to run code.
2022-08-15 15:22:15,038 - distributed.deploy.spec - WARNING - Cluster closed without starting up
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/distributed/deploy/spec.py", line 308, in _start
    self.scheduler = cls(**self.scheduler_spec.get("options", {}))
  File "/usr/local/lib/python3.10/site-packages/distributed/scheduler.py", line 3034, in __init__
    j.initialize(
  File "/usr/local/lib/python3.10/site-packages/traitlets/config/application.py", line 110, in inner
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/jupyter_server/serverapp.py", line 2466, in initialize
    self.init_signal()
  File "/usr/local/lib/python3.10/site-packages/jupyter_server/serverapp.py", line 2103, in init_signal
    signal.signal(signal.SIGINT, self._handle_sigint)
  File "/usr/local/lib/python3.10/signal.py", line 56, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread of the main interpreter

Anything else we need to know?: I am guessing this is because SpecCluster is calling Scheduler() from an async block?

Environment:

mrocklin commented 2 years ago

cc @zsailer @Carreau in case they have thoughts

mrocklin commented 2 years ago

For reference, when running locally we often create a Scheduler in a thread. Would it be possible to not register signal handlers if we're not running from the main thread?

mrocklin commented 2 years ago

@bnaul I'm able to reproduce this error. It's not that it's calling Scheduler from an async block, it's that it's calling it in a separate thread. The interactive Cluster and Client objects create a new thread where they run Dask things. This keeps the main thread open for user interactions.

This case didn't come up in testing because it's a little strange to use the interactive Cluster objects in a situation where you would want a Jupyter notebook. In that case you already clearly have access to the machine where things are running.

Assuming that you're running KubeCluster I suspect that there is an option that runs the Scheduler in a remote pod. In that case I suspect that you wouldn't run into an issue.

bnaul commented 2 years ago

Thanks @mrocklin, that makes sense. I am doing something unusual (maybe ill-advised?) here: I'm using Helm to manage this scheduler but still want adaptivity, so I'm calling KubeCluster(deploy_mode="local") in an entrypoint script which I guess as you point out is causing the scheduler to be created in a separate thread.

It does sound like this is still a real issue but will defer to you on whether it's worth investigating or just shouldn't be supported.

Carreau commented 2 years ago

I think you should be able to do :

if threading.current_thread() is not threading.main_thread():
    ServerApp.init_signal = lambda self:None
    ServerApp._restore_sigint_handler = lambda self:None

And none of the signal will we setup when you call ServerApp.initialize.