Closed jabbera closed 1 year ago
Rolling this back sorts the issue: https://github.com/dask/distributed/pull/7631
cc @milesgranger @jacobtomlinson for visibility
I did a little more digging and the crash happens somewhere in here, specifically the template format. https://github.com/dask/distributed/blob/f1023572623830e12fd162af02eff4d73bf6c6a1/distributed/utils.py#L1252-L1254
Replacing it to return a constant string avoids the crash.
I've figured out what is going on here but I don't know how to fix it in dask. I have the following environment variables set:
DASK_DISTRIBUTEDDASHBOARDLINK='{JUPYTERHUB_EXTERNAL_BASE_URL}{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status'
This is somehow making it's way over to the SSHCluster (I'm assuming via dask config serialization)
The issue is those environment variables (JUPYTERHUB_EXTERNAL_BASE_URL, JUPYTERHUB_SERVICE_PREFIX) are not available in the SSH session since they are set in the profile so the template.format is failing:
KeyError: 'JUPYTERHUB_EXTERNAL_BASE_URL'
I understand how to get the correct scheduler link manually. I'd prefer if this situation doesn't cause the scheduler to crash and maybe just falls back on it's old behavior if the link can't be crafted.
PS. These errors are not being propagated back to the process that started the cluster which has made debugging this much harder.
Thanks for taking the time to dig into this. It sounds like there are two things going on here.
First is that when DASK_DISTRIBUTED__DASHBOARD__LINK
has been misconfigured SSHCluster
is crashing for hard to understand reasons. The root of that problem is that you're not seeing helpful error messages, making debugging it a pain. The fix for this would be to explore why tracebacks aren't making it back from the remote process.
The other part is can we make it so that if DASK_DISTRIBUTED__DASHBOARD__LINK
is misconfigured the failure mode is less aggressive. I'm not sure default behaviour would be best though as it will likely mask the problem and make it hard to debug. Perhaps a better route would be to catch the exception and log an error saying that formatting failed but continue onwards.
Perhaps a better route would be to catch the exception and log an error saying that formatting failed but continue onwards.
Indeed this would be the best solution.
First is that when
DASK_DISTRIBUTED__DASHBOARD__LINK
has been misconfiguredSSHCluster
is crashing for hard to understand reasons. The root of that problem is that you're not seeing helpful error messages, making debugging it a pain. The fix for this would be to explore why tracebacks aren't making it back from the remote process.
When I initially read this, I didn't totally understand what you meant by "misconfigured" here. As I understand it, the problem is that the link includes an environment variable that exists only on the host and not on the cluster.
Thus, these would be incorrect...
export DASK_DISTRIBUTED__DASHBOARD__LINK="{JUPYTERHUB_EXTERNAL_BASE_URL}{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status"
# or
export DASK_DISTRIBUTED__DASHBOARD__LINK="{JUPYTERHUB_SERVICE_PREFIX}proxy/{port}/status"
...and this would be correct: (quoted from https://github.com/dask/dask-labextension/issues/109#issuecomment-580151835):
export DASK_DISTRIBUTED__DASHBOARD__LINK="proxy/{port}/status"
However, the correct link doesn't work.
Suppose I have a JupyterHub deployment and I access my notebook server at:
https://jupyterhub.example.com/user/matt.plough/my-named-server/lab
Setting my DASK_DISTRIBUTED__DASHBOARD__LINK="proxy/{port}/status"
results in the browser creating the following link in the output of a cell that says client
:
https://jupyterhub.example.com/user/matt.plough/my-named-server/files/proxy/8787/status?_xsrf=[some token]
This is incorrect due to the inclusion of /files
, something that does not occur when JUPYTERHUB_SERVICE_PREFIX
is part of the DASK_DISTRIBUTED__DASHBOARD__LINK
variable.
The recommendation in Dask documentation of /user/<user>/proxy/8787/status
cannot accommodate named servers, and is not flexible enough to deal with standard servers and named servers on the same box. Use of the JUPYTERHUB_SERVICE_PREFIX
eliminates all of these problems.
How should users configure the DASK_DISTRIBUTED__DASHBOARD__LINK
variable when using a JupyterHub proxy?
How should users configure the DASK_DISTRIBUTEDDASHBOARDLINK variable when using a JupyterHub proxy?
I think this question is separate from the bug highlighted here. Could you open a new issue for this?
How should users configure the DASK_DISTRIBUTEDDASHBOARDLINK variable when using a JupyterHub proxy?
I think this question is separate from the bug highlighted here. Could you open a new issue for this?
Good idea, and done - see https://github.com/dask/distributed/issues/7736.
Describe the issue: Attempting to use the SSHCluster does not work in 2023.3.2 because the scheduler exits early with an exit code of 1
When rolling back to 2023.3.1 the scheduler starts sucessfully:
Minimal Complete Verifiable Example:
Anything else we need to know?: Full repro here:
Environment: