Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1

jabbera commented 1 year ago

Describe the issue: Attempting to use the SSHCluster does not work in 2023.3.2 because the scheduler exits early with an exit code of 1

INFO:distributed.deploy.ssh:2023-03-29 18:21:07,199 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-03-29 18:21:07,204 - distributed.deploy.ssh - INFO - 2023-03-29 18:21:07,204 - distributed.scheduler - INFO - State start
INFO:distributed.deploy.ssh:2023-03-29 18:21:07,204 - distributed.scheduler - INFO - State start
2023-03-29 18:21:07,207 - distributed.deploy.ssh - INFO - 2023-03-29 18:21:07,206 - distributed.scheduler - DEBUG - Clear task state
INFO:distributed.deploy.ssh:2023-03-29 18:21:07,206 - distributed.scheduler - DEBUG - Clear task state
2023-03-29 18:21:07,207 - distributed.deploy.ssh - INFO - 2023-03-29 18:21:07,207 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:36143
INFO:distributed.deploy.ssh:2023-03-29 18:21:07,207 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:36143
INFO:asyncssh:[conn=0, chan=1] Received exit status 1
INFO:asyncssh:[conn=0, chan=1] Received channel close
INFO:asyncssh:[conn=0, chan=1] Channel closed
INFO:asyncssh:[conn=0, chan=1] Sending KILL signal

When rolling back to 2023.3.1 the scheduler starts sucessfully:

2023-03-29 18:23:33,874 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,873 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,873 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2023-03-29 18:23:33,878 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,878 - distributed.scheduler - INFO - State start
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,878 - distributed.scheduler - INFO - State start
2023-03-29 18:23:33,882 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,881 - distributed.scheduler - DEBUG - Clear task state
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,881 - distributed.scheduler - DEBUG - Clear task state
2023-03-29 18:23:33,883 - distributed.deploy.ssh - INFO - 2023-03-29 18:23:33,882 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:40305
INFO:distributed.deploy.ssh:2023-03-29 18:23:33,882 - distributed.scheduler - INFO -   Scheduler at:   tcp://10.15.40.68:40305
INFO:asyncssh:Opening SSH connection to localhost, port 22
INFO:asyncssh:[conn=1] Connected to SSH server at localhost, port 22

Minimal Complete Verifiable Example:

import logging
logging.basicConfig(level=logging.DEBUG)

from distributed import SSHCluster
cluster = SSHCluster(["localhost", "localhost"])

Anything else we need to know?: Full repro here:

git clone https://github.com/jabbera/distributed-bug.git
cd distributed-bug
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements-bug.txt
python demo.py

Environment:

Dask version: 2023.3.2
Python version: 3.10.5
Operating System: Ubuntu 20.04.5 LTS
Install method (conda, pip, source): pip

jabbera commented 1 year ago

Rolling this back sorts the issue: https://github.com/dask/distributed/pull/7631

jrbourbeau commented 1 year ago

cc @milesgranger @jacobtomlinson for visibility

jabbera commented 1 year ago

I did a little more digging and the crash happens somewhere in here, specifically the template format. https://github.com/dask/distributed/blob/f1023572623830e12fd162af02eff4d73bf6c6a1/distributed/utils.py#L1252-L1254

Replacing it to return a constant string avoids the crash.

jabbera commented 1 year ago

I've figured out what is going on here but I don't know how to fix it in dask. I have the following environment variables set:

DASK_DISTRIBUTEDDASHBOARDLINK='{JUPYTERHUB_EXTERNAL_BASE_URL}{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status'

This is somehow making it's way over to the SSHCluster (I'm assuming via dask config serialization)

The issue is those environment variables (JUPYTERHUB_EXTERNAL_BASE_URL, JUPYTERHUB_SERVICE_PREFIX) are not available in the SSH session since they are set in the profile so the template.format is failing:

KeyError: 'JUPYTERHUB_EXTERNAL_BASE_URL'

I understand how to get the correct scheduler link manually. I'd prefer if this situation doesn't cause the scheduler to crash and maybe just falls back on it's old behavior if the link can't be crafted.

PS. These errors are not being propagated back to the process that started the cluster which has made debugging this much harder.

jacobtomlinson commented 1 year ago

Thanks for taking the time to dig into this. It sounds like there are two things going on here.

First is that when DASK_DISTRIBUTED__DASHBOARD__LINK has been misconfigured SSHCluster is crashing for hard to understand reasons. The root of that problem is that you're not seeing helpful error messages, making debugging it a pain. The fix for this would be to explore why tracebacks aren't making it back from the remote process.

The other part is can we make it so that if DASK_DISTRIBUTED__DASHBOARD__LINK is misconfigured the failure mode is less aggressive. I'm not sure default behaviour would be best though as it will likely mask the problem and make it hard to debug. Perhaps a better route would be to catch the exception and log an error saying that formatting failed but continue onwards.

https://github.com/dask/distributed/blob/78a926d159b70f7c782b8f297dda6b6e695a0d0b/distributed/scheduler.py#L3873

jabbera commented 1 year ago

Perhaps a better route would be to catch the exception and log an error saying that formatting failed but continue onwards.

Indeed this would be the best solution.

mplough-kobold commented 1 year ago

First is that when DASK_DISTRIBUTED__DASHBOARD__LINK has been misconfigured SSHCluster is crashing for hard to understand reasons. The root of that problem is that you're not seeing helpful error messages, making debugging it a pain. The fix for this would be to explore why tracebacks aren't making it back from the remote process.

When I initially read this, I didn't totally understand what you meant by "misconfigured" here. As I understand it, the problem is that the link includes an environment variable that exists only on the host and not on the cluster.

Thus, these would be incorrect...

export DASK_DISTRIBUTED__DASHBOARD__LINK="{JUPYTERHUB_EXTERNAL_BASE_URL}{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status"
# or
export DASK_DISTRIBUTED__DASHBOARD__LINK="{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status"

...and this would be correct: (quoted from https://github.com/dask/dask-labextension/issues/109#issuecomment-580151835):

export DASK_DISTRIBUTED__DASHBOARD__LINK="proxy/{port}/status"

However, the correct link doesn't work.

Suppose I have a JupyterHub deployment and I access my notebook server at:

https://jupyterhub.example.com/user/matt.plough/my-named-server/lab

Setting my DASK_DISTRIBUTED__DASHBOARD__LINK="proxy/{port}/status" results in the browser creating the following link in the output of a cell that says client:

https://jupyterhub.example.com/user/matt.plough/my-named-server/files/proxy/8787/status?_xsrf=[some token]

This is incorrect due to the inclusion of /files, something that does not occur when JUPYTERHUB_SERVICE_PREFIX is part of the DASK_DISTRIBUTED__DASHBOARD__LINK variable.

The recommendation in Dask documentation of /user/<user>/proxy/8787/status cannot accommodate named servers, and is not flexible enough to deal with standard servers and named servers on the same box. Use of the JUPYTERHUB_SERVICE_PREFIX eliminates all of these problems.

How should users configure the DASK_DISTRIBUTED__DASHBOARD__LINK variable when using a JupyterHub proxy?

jacobtomlinson commented 1 year ago

How should users configure the DASK_DISTRIBUTEDDASHBOARDLINK variable when using a JupyterHub proxy?

I think this question is separate from the bug highlighted here. Could you open a new issue for this?

mplough-kobold commented 1 year ago

How should users configure the DASK_DISTRIBUTEDDASHBOARDLINK variable when using a JupyterHub proxy?

I think this question is separate from the bug highlighted here. Could you open a new issue for this?

Good idea, and done - see https://github.com/dask/distributed/issues/7736.

dask / distributed

Scheduler crashes in SSHCluster in 2023.3.2 but not in 2023.3.1 #7724