dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

self.connection.run("cmd /c ver") causing problems #5411

Open morikplay opened 3 years ago

morikplay commented 3 years ago

What happened: SSHCluster() launch fails on Windows Server 2019 system. Turning on debug logs shows it fails at ln#179 (for scheduler), and when that gets 'fixed' (by me), it fails at ln#98 (for worker)

<< providing asyncssh.set_debug_level(2) detail log here>>
[INFO] 2021-10-12 18:20:35,479 logging.py:82 [conn=2, chan=1] Requesting new SSH session
[INFO] 2021-10-12 18:20:35,481 logging.py:82 [conn=2, chan=1]   Command: cmd /c ver
...
<< providing asyncssh.set_debug_level(3) detail log here>>
distributed.deploy.ssh - INFO - File "..\lib\runpy.py", line 188, in _run_module_as_main
distributed.deploy.ssh - INFO - mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
distributed.deploy.ssh - INFO - File "..\lib\runpy.py", line 111, in _get_module_details
distributed.deploy.ssh - INFO - __import__(pkg_name)
distributed.deploy.ssh - INFO - File "..\lib\site-packages\distributed\__init__.py", line 24, in <module>
distributed.deploy.ssh - INFO - from .deploy import Adaptive, LocalCluster, SpecCluster, SSHCluster
distributed.deploy.ssh - INFO - File "..\lib\site-packages\distributed\deploy\__init__.py", line 7, in <module>
distributed.deploy.ssh - INFO - from .ssh import SSHCluster
distributed.deploy.ssh - INFO - File "..\lib\site-packages\distributed\deploy\ssh.py", line 98
distributed.deploy.ssh - INFO - result = await self.connection.run(cmd /c "ver")
distributed.deploy.ssh - INFO - ^
distributed.deploy.ssh - INFO - SyntaxError: invalid syntax
...
[INFO] 2021-10-12 18:20:35,502 logging.py:82 [conn=2, chan=1] Received exit status 1
[INFO] 2021-10-12 18:20:35,503 logging.py:82 [conn=2, chan=1] Received channel close
[INFO] 2021-10-12 18:20:35,505 logging.py:82 [conn=2, chan=1] Channel closed
...
...
Exception: Scheduler failed to set DASK_INTERNAL_INHERIT_CONFIG variable 
        `result = await self.connection.run("cmd /c ver")`

What you expected to happen: indicated scheduler+workers ought to launch via SSHCluster().

Minimal Complete Verifiable Example: Reproduced this error on multiple Windows 2019 server systems.

Changing ln#98 and ln#179 results in successful establishment of desired dask scheduler and workers. 
`result = await self.connection.run("ver")`

Anything else we need to know?: However, it introduces additional issue in that subsequent conda env changes fail (due to size mismatch), and also versionmismatch warning/errors start popping because the appropriate env doesn't load right. Consequently, can't view dashboard via bokeh and such.

Environment: OS: Windows Server 2019 v1809 Python: 3.9.7 Dask Distributed: 2021.9.1 Asyncssh: 2.7.1 Python version: 3.9.7 Install method: conda

jrbourbeau commented 3 years ago

Thanks for raising an issue @morikplay. In the current main branch of distributed, "cmd /c ver" is being passed to self.connection.run

https://github.com/dask/distributed/blob/a1b67b84226d3053517dffcb6f0c8fe8821eb8fb/distributed/deploy/ssh.py#L104

In logs you posted, cmd /c "ver" is used instead. I'm wondering if this is where the invalid syntax is being introduced. You mentioned "when that gets 'fixed' (by me)", are you using a patched version of distributed?

morikplay commented 3 years ago

ahh... thank you for looking into the issue @jrbourbeau. cmd /c "ver" is just my typo whilst cleaning up the logs (for posting here). cmd /c ver is what is passed toself.connection.run(), and that is what is causing result code -1 (which then fails cluster establishment). Changing cmd /c ver to result = await self.connection.run("ver") fixes the issue but causes other issues w/ environmental imports and such.

I used distributed version that is available via conda. exporting env shows the following version: distributed=2021.9.1=py39hcbf5309_0

jrbourbeau commented 3 years ago

Ah, I see -- thanks for clarifying @morikplay. Unfortunately I'm not familiar with Windows and don't have access to a machine to test things out. Perhaps @abduhbm has thoughts on how the current situation might be improved?

abduhbm commented 3 years ago

As suggested here: https://github.com/PowerShell/Win32-OpenSSH/issues/1373, changing cmd /c ver to cmd.exe /c ver should fix the issue on Windows Server 2019.

@morikplay Can you please try this change from your side?

morikplay commented 3 years ago

Proposed change of cmd.exe /c ver works for both scenarios: ssh'ing locally and ssh'ing remotely!

[INFO] 2021-10-25 09:15:27,016 logging.py:82 [conn=0, chan=1] Requesting new SSH session
[INFO] 2021-10-25 09:15:27,018 logging.py:82 [conn=0, chan=1]   Command: cmd.exe /c ver
[INFO] 2021-10-25 09:15:27,037 logging.py:82 [conn=0, chan=1] Received exit status 0
[INFO] 2021-10-25 09:15:27,038 logging.py:82 [conn=0, chan=1] Received channel close
[INFO] 2021-10-25 09:15:27,039 logging.py:82 [conn=0, chan=1] Channel closed
[DEBUG] 2021-10-25 09:15:27,041 logging.py:82 [conn=0, chan=2] Set write buffer limits: low-water=16384, high-water=65536
[INFO] 2021-10-25 09:15:27,042 logging.py:82 [conn=0, chan=2] Requesting new SSH session
[INFO] 2021-10-25 09:15:27,043 logging.py:82 [conn=0, chan=2]   Command: set DASK_INTERNAL_INHERIT_CONFIG=<XYZ..>
abduhbm commented 3 years ago

Thanks @morikplay ! I will create a PR for this.