determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
2.99k stars 348 forks source link

VS Code integration does not work on Windows #7726

Closed BingxingDong closed 8 months ago

BingxingDong commented 1 year ago

Describe your question

I ran the following command on the remote host to start a shell ,then it outputs an ssh connection command:


root@VM-2-24-ubuntu:/mnt/beegfs/determined/determined/examples/nlp/bert_glue_pytorch# det shell start --config-file shell_config.yaml --show-ssh-command CLI version 0.23.3 is less than master version 0.24.0. Consider upgrading the CLI. Launched shell (id: da3debea-c87a-464d-95bb-b0d098e317a6). shell (id: da3debea-c87a-464d-95bb-b0d098e317a6) is ready.
ssh -o "ProxyCommand=/usr/bin/python3 -m determined.cli.tunnel localhost:8080 %h" -o StrictHostKeyChecking=no -tt -o IdentitiesOnly=yes -i /root/.cache/determined/shell/da3debea-c87a-464d-95bb-b0d098e317a6/key root@da3debea-c87a-464d-95bb-b0d098e317a6 Warning: Permanently added 'da3debea-c87a-464d-95bb-b0d098e317a6' (RSA) to the list of known hosts.


But when I connect with local VSCode, there is an error:


[19:36:27.194] Log Level: 2 [19:36:27.203] SSH Resolver called for "ssh-remote+da3debea-c87a-464d-95bb-b0d098e317a6", attempt 1 [19:36:27.204] "remote.SSH.useLocalServer": false [19:36:27.204] "remote.SSH.showLoginTerminal": false [19:36:27.204] "remote.SSH.remotePlatform": {"10.26.2.106":"linux","10.48.2.2":"linux","10.26.2.108":"linux","10.48.2.200":"linux","10.48.2.7":"linux","10.26.2.61":"linux"} [19:36:27.204] "remote.SSH.path": undefined [19:36:27.204] "remote.SSH.configFile": undefined [19:36:27.204] "remote.SSH.useFlock": true [19:36:27.204] "remote.SSH.lockfilesInTmp": false [19:36:27.204] "remote.SSH.localServerDownload": auto [19:36:27.205] "remote.SSH.remoteServerListenOnSocket": false [19:36:27.205] "remote.SSH.showLoginTerminal": false [19:36:27.205] "remote.SSH.defaultExtensions": [] [19:36:27.205] "remote.SSH.loglevel": 2 [19:36:27.205] "remote.SSH.enableDynamicForwarding": true [19:36:27.205] "remote.SSH.enableRemoteCommand": false [19:36:27.205] "remote.SSH.serverPickPortsFromRange": {} [19:36:27.205] "remote.SSH.serverInstallPath": {} [19:36:27.213] VS Code version: 1.79.0-insider [19:36:27.213] Remote-SSH version: remote-ssh@0.102.0 [19:36:27.213] win32 x64 [19:36:27.216] SSH Resolver called for host: da3debea-c87a-464d-95bb-b0d098e317a6 [19:36:27.216] Setting up SSH remote "da3debea-c87a-464d-95bb-b0d098e317a6" [19:36:27.219] Using commit id "b380da4ef1ee00e224a15c1d4d9793e27c2b6302" and quality "insider" for server [19:36:27.222] Install and start server if needed [19:36:30.244] Checking ssh with "C:\Windows\system32\ssh.exe -V" [19:36:30.248] Got error from ssh: spawn C:\Windows\system32\ssh.exe ENOENT [19:36:30.248] Checking ssh with "C:\Windows\ssh.exe -V" [19:36:30.250] Got error from ssh: spawn C:\Windows\ssh.exe ENOENT [19:36:30.250] Checking ssh with "C:\Windows\System32\Wbem\ssh.exe -V" [19:36:30.252] Got error from ssh: spawn C:\Windows\System32\Wbem\ssh.exe ENOENT [19:36:30.252] Checking ssh with "C:\Windows\System32\WindowsPowerShell\v1.0\ssh.exe -V" [19:36:30.254] Got error from ssh: spawn C:\Windows\System32\WindowsPowerShell\v1.0\ssh.exe ENOENT [19:36:30.255] Checking ssh with "C:\Windows\System32\OpenSSH\ssh.exe -V" [19:36:30.287] > OpenSSH_for_Windows_8.1p1, LibreSSL 3.0.2

[19:36:30.289] Running script with connection command: "C:\Windows\System32\OpenSSH\ssh.exe" -T -D 60194 "da3debea-c87a-464d-95bb-b0d098e317a6" bash [19:36:30.292] Terminal shell path: C:\Windows\System32\cmd.exe [19:36:30.479] > ]0;C:\Windows\System32\cmd.exe [19:36:30.479] Got some output, clearing connection timeout [19:36:30.487] > CreateProcessW failed error:2

posix_spawnp: No such file or directory 过程试图写入的管道不存在。

[19:36:31.865] "install" terminal command done [19:36:31.865] Install terminal quit with output: 过程试图写入的管道不存在。 [19:36:31.865] Received install output: 过程试图写入的管道不存在。 [19:36:31.866] Failed to parse remote port from server output [19:36:31.867] Resolver error: Error: at m.Create (c:\Users\bingxing_dong.vscode-insiders\extensions\ms-vscode-remote.remote-ssh-0.102.0\out\extension.js:1:584145) at t.handleInstallOutput (c:\Users\bingxing_dong.vscode-insiders\extensions\ms-vscode-remote.remote-ssh-0.102.0\out\extension.js:1:582705) at t.tryInstall (c:\Users\bingxing_dong.vscode-insiders\extensions\ms-vscode-remote.remote-ssh-0.102.0\out\extension.js:1:681881) at async c:\Users\bingxing_dong.vscode-insiders\extensions\ms-vscode-remote.remote-ssh-0.102.0\out\extension.js:1:644110 at async t.withShowDetailsEvent (c:\Users\bingxing_dong.vscode-insiders\extensions\ms-vscode-remote.remote-ssh-0.102.0\out\extension.js:1:647428) at async t.resolve (c:\Users\bingxing_dong.vscode-insiders\extensions\ms-vscode-remote.remote-ssh-0.102.0\out\extension.js:1:645160) at async c:\Users\bingxing_dong.vscode-insiders\extensions\ms-vscode-remote.remote-ssh-0.102.0\out\extension.js:1:720916 [19:36:31.872] ------


What is the situation here??How can I connect with local VSCode??

ioga commented 1 year ago

thank you for the report. we were able to repro this issue on Windows, and will investigate it.

BingxingDong commented 1 year ago

Is this problem solved?Thank you!!

ioga commented 1 year ago

we do not have a solution and cannot provide an ETA at this time, sorry.

BingxingDong commented 1 year ago

================================================================================= Oh,Thank you!!But I still have a question:I used the “det shell start” command to start a shell on the master node,then i used "det shell show_ssh_command 25bd41bd-c550-430a-a43a-4f36002ecde4" command to show a ssh connection command:


bingxing_dong@VM-2-24-ubuntu:/mnt/beegfs/determined/determined/examples/nlp/bert_glue_pytorch$ det shell show_ssh_command 25bd41bd-c550-430a-a43a-4f36002ecde4 CLI version 0.23.3 is less than master version 0.24.0. Consider upgrading the CLI. ssh -o "ProxyCommand=/usr/bin/python3 -m determined.cli.tunnel localhost:8080 %h" -o StrictHostKeyChecking=no -tt -o IdentitiesOnly=yes -i /home/bingxing_dong/.cache/determined/shell/25bd41bd-c550-430a-a43a-4f36002ecde4/key root@25bd41bd-c550-430a-a43a-4f36002ecde4


================================================================================= But i found that i can only use this ssh connection command on the master node to connect to the docker container of the child node.Once I run this ssh connection command on other machines, an error will be reported:


bingxing_dong@VM-2-16-ubuntu:~$ ssh -o "ProxyCommand=/usr/bin/python3 -m determined.cli.tunnel localhost:8080 %h" -o StrictHostKeyChecking=no -tt -o IdentitiesOnly=yes -i /home/bingxing_dong/.cache/determined/shell/25bd41bd-c550-430a-a43a-4f36002ecde4/key root@25bd41bd-c550-430a-a43a-4f36002ecde4 Warning: Identity file /home/bingxing_dong/.cache/determined/shell/25bd41bd-c550-430a-a43a-4f36002ecde4/key not accessible: No such file or directory. disconnecting websocket Exception in thread Thread-2 (copy_from_websocket): kex_exchange_identification: Connection closed by remote host Traceback (most recent call last): Connection closed by UNKNOWN port 65535 File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner


================================================================================= So my question is that if this ssh connection command can only be run on the master node?Can other machines, such as local VSCode or other child nodes of the cluster, use this ssh connection command?

BingxingDong commented 1 year ago

================================================================================= After I put the content of the /home/bingxing_dong/.cache/determined/shell/25bd41bd-c550-430a-a43a-4f36002ecde4/key file of the master node into the corresponding path of other child nodes, and then run the above ssh connection command, The error is like this:


bingxing_dong@VM-2-16-ubuntu:~$ ssh -o "ProxyCommand=/usr/bin/python3 -m determined.cli.tunnel localhost:8080 %h" -o StrictHostKeyChecking=no -tt -o IdentitiesOnly=yes -i /home/bingxing_dong/.cache/determined/shell/25bd41bd-c550-430a-a43a-4f36002ecde4/key root@25bd41bd-c550-430a-a43a-4f36002ecde4 disconnecting websocket Exception in thread Thread-2 (copy_from_websocket): Traceback (most recent call last): kex_exchange_identification: Connection closed by remote host File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner Connection closed by UNKNOWN port 65535


ioga commented 1 year ago

So my question is that if this ssh connection command can only be run on the master node?Can other machines, such as local VSCode or other child nodes of the cluster, use this ssh connection command?

After I put the content of the /home/bingxing_dong/.cache/determined/shell/25bd41bd-c550-430a-a43a-4f36002ecde4/key file of the master node into the corresponding path of other child nodes, and then run the above ssh connection command, The error is like this:

you don't have to copy the ssh command and the underlying key directly between the nodes. you can use det shell open <shell id> on this other machine to connect to the same shell.

most likely the reason for the failure you're seeing when you copy the ssh command is that, as you can see, the proxy command includes the master server address which happens to be localhost:8080. when you run it directly on another node, the master is no longer at localhost, and the connection fails.

BingxingDong commented 1 year ago

1.Thank you!!Your reply is very useful,i should use det shell open <shell id> on other machine to connect to the same shell. 2.But i still can not connect to the docker container of the child node when i use the local VSCode with the same ssh command.Looking forward for your resolution!!

BingxingDong commented 1 year ago

1.Thank you!!Your reply is very useful,i should use det shell open <shell id> on other machine to connect to the same shell. 2.But i still can not connect to the docker container of the child node when i use the same ssh command with the local VSCode. Looking forward for your resolution!!

MikhailKardash commented 10 months ago

Regarding the VSCode integration, I root-caused the issue to be that VSCode uses CMD and Windows OpenSSH under the hood. You can see that in your logs actually:

Terminal shell path: C:\Windows\System32\cmd.exe

To work around this, you need to do 2 things:

  1. Add wsl.exe to your ProxyCommand, so something like: ProxyCommand= C:\Windows\System32\wsl.exe /usr/bin/python3 -m determined.cli.tunnel localhost:8080 %h"
  2. Move your key out of WSL and into the Windows filesystem, cp /home/<username>/.cache/determined/shell/<your_shell_id>/key /mnt/c/path/to/your/key and make sure your IdentityFile is changed in the config: IdentityFile C:\path\to\your\key

I opened a PR to add this to our documentation.

SUNXQ0407 commented 8 months ago

thank you for the report. we were able to repro this issue on Windows, and will investigate it.

Hello, I also have a similar problem, but there is a small difference. I'm curious why I can connect through mobaxterm using ssh, but not using vscode. [10:22:14.756] Showing password prompt [10:22:27.524] Got password response [10:22:27.525] "install" wrote data to terminal: "****" [10:22:27.556] > [10:22:27.572] > exec request failed on channel 2 [10:22:27.586] > 过程试图写入的管道不存在。 [10:22:28.931] "install" terminal command done [10:22:28.932] Install terminal quit with output: 过程试图写入的管道不存在。 [10:22:28.932] Received install output: 过程试图写入的管道不存在。 [10:22:28.933] Failed to parse remote port from server output

MikhailKardash commented 8 months ago

Added a warning to the CLI ssh command. show-ssh-command should be run in Windows shell instead of WSL if the user intends to use VSCode. There's also a bug in our fix PR, which we are working on addressing.