`SSHRemoteIO` on windows hangs

mih commented 10 months ago

This is an interim conclusion from https://github.com/datalad/datalad-ria/pull/58

A test tries to run init_remote() for an ora remote, not even expecting it to work, but error out due to insufficiently meet preconditions. But it just hangs.

This may be due to the fact that the target connection is via SSH and the target port is ignored by the implementation of SSHRemoteIO from core. but the same code passes on mac and linux.

This could be investigated by deploying a matching ssh config that declares the port.

mih commented 10 months ago

https://github.com/datalad/datalad-ria/pull/69 narrows this down to SSHRemoteIO

mih commented 9 months ago

Debugging log on windows:

Plain ssh works fine

c:\projects\datalad-ria>ssh -i C:\DLTMP\datalad_tests_id_rsa ssh://sshuser@datalad-test-ria:2222
Linux 1ba80b1234c2 4.19.27-linuxkit #1 SMP Sun Mar 10 18:51:44 UTC 2019 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Sep 14 06:45:43 2023 from 10.0.0.40
sshuser@1ba80b1234c2:~$ exit
logout
Connection to datalad-test-ria closed.

With SSHRemoteIO not

>>> import datalad
>>> datalad.cfg.set('datalad.log.level', 'debug', scope='global')
>>> datalad.cfg.set('datalad.ssh.identityfile', 'C:\DLTMP\datalad_tests_id_rsa', scope='global')
>>> import datalad.distributed.ora_remote as ora
>>> ssh=ora.SSHRemoteIO('ssh://sshuser@datalad-test-ria:2222')

<hangs>

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python39-x64\lib\site-packages\datalad\distributed\ora_remote.py", line 351, in __init__
    line = self.shell.stdout.readline()

With SSHManager

>>> import datalad
>>> datalad.cfg.get('datalad.ssh.multiplex-connections')
>>> from datalad.support.sshconnector import SSHManager as sshman
>>> sshman
<class 'datalad.support.sshconnector.NoMultiplexSSHManager'>
>>> url = 'ssh://sshuser@datalad-test-ria:2222'
>>> sshman._prep_connection_args(url)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: _prep_connection_args() missing 1 required positional argument: 'url'
>>> sshman._prep_connection_args(url)sm
KeyboardInterrupt
>>> sm = sshman()
>>> sm._prep_connection_args(url)
(URL(hostname='datalad-test-ria', netloc='sshuser@datalad-test-ria:2222', port=2222, scheme='ssh', username='sshuser'), 'C:\\DLTMP\\datalad_tests_id_rsa')
>>> sm.get_connection(url)
NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>)
>>> con = sm.get_connection(url)
>>> con.ssh_executable
'C:\\Windows\\System32\\OpenSSH\\ssh.exe'
>>> con._ssh_open_args
['-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa']
>>> con._ssh_args
[]
>>> con('uname')
[DEBUG] NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>) is used to run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname']
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] with status 0
('Linux\n', '')
>>>
<hangs>

and it can be repeatedly hung:

>>> import datalad
>>> from datalad.support.sshconnector import SSHManager as sshman
>>> sm = sshman()
>>> url = 'ssh://sshuser@datalad-test-ria:2222'
>>> con = sm.get_connection(url)
>>> con('uname')
[DEBUG] NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>) is used to run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname']
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] with status 0
('Linux\n', '')
>>>
<hangs>

What seems important is that the function returns (the return value is there), and there is a new prompt, but afterwards it "hangs", as if the interpreter is not accepting new inputs. once I kill the Python process, I can reuse the terminal session with a new Python session without issues, and I can make it hang at that exact same spot.

I can also make it hang, if I run that exact same command directly via the runner:

>>> con.runner.run(['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'])
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=NoCapture) (cwd=None)
Linux
>>>

<hangs>

and also without any DataLad at all

>>> import subprocess
>>> subprocess.run(['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'])

Linux
CompletedProcess(args=['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', ....
>>>  

<hangs>

But it needs the python session, outside (but in the same terminal session), it runs fine, any number of times:

c:\projects\datalad-ria>C:\Windows\System32\OpenSSH\ssh.exe -p 2222 -i C:\DLTMP\datalad_tests_id_rsa sshuser@datalad-test-ria uname
Linux

c:\projects\datalad-ria>
c:\projects\datalad-ria>C:\Windows\System32\OpenSSH\ssh.exe -p 2222 -i C:\DLTMP\datalad_tests_id_rsa sshuser@datalad-test-ria uname
Linux

c:\projects\datalad-ria>

mih commented 9 months ago

subprocess docs contain the following bits that may need to be checked

If close_fds is true, all file descriptors except 0, 1 and 2 will be closed before the child process is executed. Otherwise when close_fds is false, file descriptors obey their inheritable flag as described in Inheritance of File Descriptors.

On Windows, if close_fds is true then no handles will be inherited by the child process unless explicitly passed in the handle_list element of STARTUPINFO.lpAttributeList, or by standard handle redirection.

Changed in version 3.2: The default for close_fds was changed from False to what is described above.

Changed in version 3.7: On Windows the default for close_fds was changed from False to True when redirecting the standard handles. It’s now possible to set close_fds to True when redirecting the standard handles.

adswa commented 9 months ago

I tried a few more things:

versions

(py39) C:\Users\adina>where ssh
C:\Windows\System32\OpenSSH\ssh.exe

(py39) C:\Users\adina>ssh -V
OpenSSH_for_Windows_8.1p1, LibreSSL 3.0.2

(py39) C:\Users\adina>python
Python 3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> process = subprocess.Popen(["ssh", "-i",  r"C:\Users\adina\.ssh\id_ed25519_juseless", "juseless.inm7.de", "uname"], shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>>> result = process.stdout.readlines()
[hangs]

>>> subprocess.run(r"ssh -i C:\Users\adina\.ssh\id_ed25519_juseless juseless.inm7.de uname", close_fds=False)
Enter passphrase for key 'C:\Users\adina\.ssh\id_ed25519_juseless':
Linux
CompletedProcess(args='ssh -i C:\\Users\\adina\\.ssh\\id_ed25519_juseless juseless.inm7.de uname', returncode=0)
>>>
[hangs]

subprocess.run(r"ssh -i C:\Users\adina\.ssh\id_ed25519_juseless juseless.inm7.de uname", close_fds=True)
Linux
CompletedProcess(args='ssh -i C:\\Users\\adina\\.ssh\\id_ed25519_juseless juseless.inm7.de uname', returncode=0)
>>>
[hangs]

adswa commented 9 months ago

Maybe some of these Windows specific subprocess configurations could be relevant: https://docs.python.org/3/library/subprocess.html#windows-popen-helpers

adswa commented 9 months ago

I found a Gist that made things work: https://gist.github.com/josephcoombe/3a234721fc5a6885ca4f91e3a27860f4.

(it needs an additional encoding="utf-8" in the Popen() call, but otherwise works). I will investigate closer

adswa commented 9 months ago

I don't understand subprocess well, so I'm just sharing a few observations. This is how I can get subprocess.run() to run without hanging.

import os, platform, subprocess
USER = "adina"
COMMAND = "uname -a"
HOST = "juseless.inm7.de"
system32 = os.path.join(os.environ['SystemRoot'], 'SysNative' if platform.architecture()[0] == '32bit' else 'System32')
ssh_path = os.path.join(system32, 'OpenSSH/ssh.exe')
PRIVATE_KEY_LOCATION = "C:/Users/adina/.ssh/id_ed25519_juseless"
subprocess.run([ssh_path, '-i', PRIVATE_KEY_LOCATION, "{}@{}".format(USER, HOST), COMMAND], stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
CompletedProcess(args=['C:\\Windows\\System32\\OpenSSH/ssh.exe', '-i', 'C:/Users/adina/.ssh/id_ed25519_juseless', 'adina@juseless.inm7.de', 'uname -a'], returncode=0, stdout=b'Linux juseless 6.1.0-10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.37-1 (2023-07-03) x86_64 GNU/Linux\n', stderr=b'')

print("No hanging!")

No hanging!

Edit: The hanging reappears if I don't set stdout=subprocess.PIPE, and also if I don't set stdin=subprocess.PIPE

Edit: The complex ssh_path construction also does not seem to be necessary, it works also with a plain "ssh".

Edit:

I have tested that the call

subprocess.run(["ssh", '-i', PRIVATE_KEY_LOCATION, "{}@{}".format(USER, HOST), COMMAND], stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)

works (in CMD) if:

[x] I don't have Git installed
[x] I have Git installed and configured to use external SSH (i.e., the one shipped with Windows)
[x] I have Git installed and configured to use bundled SSH (i.e., the one shipped within Git)

christian-monch commented 9 months ago

I am still not sure what the root cause of the problem is.

On Windows 10, in cmd.exe and powershell.exe in an interactive python sessions, the following will end up with a hanging python interpreter:

>>> import subprocess
>>> subprocess.run(['ssh', 'unix-machine', 'uname'])
>>> print('after run')

(Weirdly enough, no other python interpreter started in the same powershell session (after killing the hanging python process) is accepting interactive input).

If I put the commands into a file, e.g. hangtest.py and execute them via python hangtest.py, the interpreter prints after run and exits.
If I execute the command with python -i hangtest.py, the interpreter prints after run, and hangs.
If I add stdin=subprocess.DEVNULL or stdin=subprocess.PIPE, to the run-parameters, the interpreter prints after run and does not hang.

Interpretation

Without overriding stdin, the subprocess, i.e. ssh, and python share the same file pointer. It seems that stdin is configured in a way that unexpected by the interpreter and messes with python's way to read from sys.stdin. For example, running the following program would hang after printing the prompt enter something>:

import subprocess
subprocess.run(['ssh', 'unix-machine', 'uname'])
print('after run')
print(input('enter something> '))

But this program will print out individual characters read from sys.stdin:

import sys
import subprocess
subprocess.run(['ssh', 'unix-machine', 'uname'])
print('after run')
while True:
    print(read(sys.stdin))

Not sure about the conclusion yet.

[Edit] It might well be that ssh configures the stdin descriptor in a way that is unexpected by python.

[Edit 2] Until the real cause of the described problems is found, it seems that, on Windows, we should never let ssh touch a stdin-file descriptor that is also used by the python interpreter itself. I.e. always specify stdin=PIPE or stdin=DEVNUL or similar and feed input into the resulting ssh-input descriptor ourselves.

mih commented 9 months ago

This all sounds like we understand how to avoid the problem. What I do not understand is what a fix could look like.

We need a fix that is SSH-specific. But the changes in handling described here seem only possible deep in the Runner code. Runner.run() docs say

stdin : file-like, bytes, Queue, or None
          If stdin is a file-like, it will be directly used as stdin for the
          subprocess. The caller is responsible for writing to it and closing it.
          If stdin is a bytes, it will be fed to stdin of the subprocess.
          If all data is written, stdin will be closed.
          If stdin is a Queue, all elements (bytes) put into the Queue will
          be passed to stdin until None is read from the queue. If None is read,
          stdin of the subprocess is closed.

Looking further, ThreadedRunner docs clarify

If stdin is None, nothing will be sent to stdin of the subprocess. More precisely, `subprocess.Popen` will be called with `stdin=None`.

That seems to imply this can only be fixed by supporting an additional argument type/value for stdin in the Runner code hierarchy, which is then used by a select code path for ssh client execution on windows (only).

Ping @christian-monch

christian-monch commented 9 months ago

This all sounds like we understand how to avoid the problem. What I do not understand is what a fix could look like.

We need a fix that is SSH-specific. But the changes in handling described here seem only possible deep in the Runner code. [...] That seems to imply this can only be fixed by supporting an additional argument type/value for stdin in the Runner code hierarchy, which is then used by a select code path for ssh client execution on windows (only). [...]

I think it can and should be fixed by calling the runner differently when executing ssh in Windows. The current runner code supports the "solution" outlined above (btw: the implementation of SSHRemoteIO in datalad-core does not use only the runner for ssh-subprocesses, but also direct ssh invocation). I will dig into a solution on my Windows machine.

[...] supporting an additional argument type/value for stdin in the Runner code hierarchy [...]

Although this might not be required for the solution of this problem, it might be a good idea to support DEVNULL. A workaround would be to pass an empty byte-array, i.e. stdin=b''.

I am on it.

christian-monch commented 9 months ago

Isolating the interpreters stdin-descriptor from ssh does indeed solve the problem in Windows. For example, the hanging code from above:

>>> import datalad
>>> from datalad.support.sshconnector import SSHManager as sshman
>>> sm = sshman()
>>> url = 'ssh://sshuser@datalad-test-ria:2222'
>>> con = sm.get_connection(url)
>>> con('uname')
[DEBUG] NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>) is used to run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname']
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] with status 0
('Linux\n', '')
>>>
<hangs>

Will not hang, with the line:

con('uname')

is replaced with

con('uname', stdin=b'')

I am looking into the code-paths that lead to ssh touching the interpreter's stdin-stream to see whether we can prevent that in the new ria-remote implementation, without any changes in datalad-core

mih commented 9 months ago

Before this is closed, we must file a companion issue in datalad-core.

mih commented 9 months ago

Reopening: Some breakage is fixed, but SSHRemoteIO is still not functional. This is only the case with #69

datalad / datalad-ria

`SSHRemoteIO` on windows hangs #68

Interpretation