Closed mih closed 9 months ago
https://github.com/datalad/datalad-ria/pull/69 narrows this down to SSHRemoteIO
Debugging log on windows:
Plain ssh works fine
c:\projects\datalad-ria>ssh -i C:\DLTMP\datalad_tests_id_rsa ssh://sshuser@datalad-test-ria:2222
Linux 1ba80b1234c2 4.19.27-linuxkit #1 SMP Sun Mar 10 18:51:44 UTC 2019 x86_64
The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.
Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Thu Sep 14 06:45:43 2023 from 10.0.0.40
sshuser@1ba80b1234c2:~$ exit
logout
Connection to datalad-test-ria closed.
With SSHRemoteIO
not
>>> import datalad
>>> datalad.cfg.set('datalad.log.level', 'debug', scope='global')
>>> datalad.cfg.set('datalad.ssh.identityfile', 'C:\DLTMP\datalad_tests_id_rsa', scope='global')
>>> import datalad.distributed.ora_remote as ora
>>> ssh=ora.SSHRemoteIO('ssh://sshuser@datalad-test-ria:2222')
<hangs>
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python39-x64\lib\site-packages\datalad\distributed\ora_remote.py", line 351, in __init__
line = self.shell.stdout.readline()
With SSHManager
>>> import datalad
>>> datalad.cfg.get('datalad.ssh.multiplex-connections')
>>> from datalad.support.sshconnector import SSHManager as sshman
>>> sshman
<class 'datalad.support.sshconnector.NoMultiplexSSHManager'>
>>> url = 'ssh://sshuser@datalad-test-ria:2222'
>>> sshman._prep_connection_args(url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: _prep_connection_args() missing 1 required positional argument: 'url'
>>> sshman._prep_connection_args(url)sm
KeyboardInterrupt
>>> sm = sshman()
>>> sm._prep_connection_args(url)
(URL(hostname='datalad-test-ria', netloc='sshuser@datalad-test-ria:2222', port=2222, scheme='ssh', username='sshuser'), 'C:\\DLTMP\\datalad_tests_id_rsa')
>>> sm.get_connection(url)
NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>)
>>> con = sm.get_connection(url)
>>> con.ssh_executable
'C:\\Windows\\System32\\OpenSSH\\ssh.exe'
>>> con._ssh_open_args
['-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa']
>>> con._ssh_args
[]
>>> con('uname')
[DEBUG] NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>) is used to run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname']
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] with status 0
('Linux\n', '')
>>>
<hangs>
and it can be repeatedly hung:
>>> import datalad
>>> from datalad.support.sshconnector import SSHManager as sshman
>>> sm = sshman()
>>> url = 'ssh://sshuser@datalad-test-ria:2222'
>>> con = sm.get_connection(url)
>>> con('uname')
[DEBUG] NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>) is used to run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname']
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=StdOutErrCapture) (cwd=None)
[DEBUG] Finished ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] with status 0
('Linux\n', '')
>>>
<hangs>
What seems important is that the function returns (the return value is there), and there is a new prompt, but afterwards it "hangs", as if the interpreter is not accepting new inputs. once I kill the Python process, I can reuse the terminal session with a new Python session without issues, and I can make it hang at that exact same spot.
I can also make it hang, if I run that exact same command directly via the runner:
>>> con.runner.run(['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'])
[DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=NoCapture) (cwd=None)
Linux
>>>
<hangs>
and also without any DataLad at all
>>> import subprocess
>>> subprocess.run(['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'])
Linux
CompletedProcess(args=['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', ....
>>>
<hangs>
But it needs the python session, outside (but in the same terminal session), it runs fine, any number of times:
c:\projects\datalad-ria>C:\Windows\System32\OpenSSH\ssh.exe -p 2222 -i C:\DLTMP\datalad_tests_id_rsa sshuser@datalad-test-ria uname
Linux
c:\projects\datalad-ria>
c:\projects\datalad-ria>C:\Windows\System32\OpenSSH\ssh.exe -p 2222 -i C:\DLTMP\datalad_tests_id_rsa sshuser@datalad-test-ria uname
Linux
c:\projects\datalad-ria>
subprocess
docs contain the following bits that may need to be checked
If close_fds is true, all file descriptors except 0, 1 and 2 will be closed before the child process is executed. Otherwise when close_fds is false, file descriptors obey their inheritable flag as described in Inheritance of File Descriptors.
On Windows, if close_fds is true then no handles will be inherited by the child process unless explicitly passed in the handle_list element of STARTUPINFO.lpAttributeList, or by standard handle redirection.
Changed in version 3.2: The default for close_fds was changed from False to what is described above.
Changed in version 3.7: On Windows the default for close_fds was changed from False to True when redirecting the standard handles. It’s now possible to set close_fds to True when redirecting the standard handles.
I tried a few more things:
versions
(py39) C:\Users\adina>where ssh
C:\Windows\System32\OpenSSH\ssh.exe
(py39) C:\Users\adina>ssh -V
OpenSSH_for_Windows_8.1p1, LibreSSL 3.0.2
(py39) C:\Users\adina>python
Python 3.9.0 (tags/v3.9.0:9cf6752, Oct 5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import subprocess
>>> process = subprocess.Popen(["ssh", "-i", r"C:\Users\adina\.ssh\id_ed25519_juseless", "juseless.inm7.de", "uname"], shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
>>> result = process.stdout.readlines()
[hangs]
>>> subprocess.run(r"ssh -i C:\Users\adina\.ssh\id_ed25519_juseless juseless.inm7.de uname", close_fds=False)
Enter passphrase for key 'C:\Users\adina\.ssh\id_ed25519_juseless':
Linux
CompletedProcess(args='ssh -i C:\\Users\\adina\\.ssh\\id_ed25519_juseless juseless.inm7.de uname', returncode=0)
>>>
[hangs]
subprocess.run(r"ssh -i C:\Users\adina\.ssh\id_ed25519_juseless juseless.inm7.de uname", close_fds=True)
Linux
CompletedProcess(args='ssh -i C:\\Users\\adina\\.ssh\\id_ed25519_juseless juseless.inm7.de uname', returncode=0)
>>>
[hangs]
Maybe some of these Windows specific subprocess configurations could be relevant: https://docs.python.org/3/library/subprocess.html#windows-popen-helpers
I found a Gist that made things work: https://gist.github.com/josephcoombe/3a234721fc5a6885ca4f91e3a27860f4.
(it needs an additional encoding="utf-8"
in the Popen() call, but otherwise works). I will investigate closer
I don't understand subprocess well, so I'm just sharing a few observations. This is how I can get subprocess.run() to run without hanging.
import os, platform, subprocess
USER = "adina"
COMMAND = "uname -a"
HOST = "juseless.inm7.de"
system32 = os.path.join(os.environ['SystemRoot'], 'SysNative' if platform.architecture()[0] == '32bit' else 'System32')
ssh_path = os.path.join(system32, 'OpenSSH/ssh.exe')
PRIVATE_KEY_LOCATION = "C:/Users/adina/.ssh/id_ed25519_juseless"
subprocess.run([ssh_path, '-i', PRIVATE_KEY_LOCATION, "{}@{}".format(USER, HOST), COMMAND], stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
CompletedProcess(args=['C:\\Windows\\System32\\OpenSSH/ssh.exe', '-i', 'C:/Users/adina/.ssh/id_ed25519_juseless', 'adina@juseless.inm7.de', 'uname -a'], returncode=0, stdout=b'Linux juseless 6.1.0-10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.37-1 (2023-07-03) x86_64 GNU/Linux\n', stderr=b'')
print("No hanging!")
No hanging!
Edit: The hanging reappears if I don't set stdout=subprocess.PIPE, and also if I don't set stdin=subprocess.PIPE
Edit: The complex ssh_path construction also does not seem to be necessary, it works also with a plain "ssh".
Edit:
I have tested that the call
subprocess.run(["ssh", '-i', PRIVATE_KEY_LOCATION, "{}@{}".format(USER, HOST), COMMAND], stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
works (in CMD) if:
I am still not sure what the root cause of the problem is.
On Windows 10, in cmd.exe
and powershell.exe
in an interactive python sessions, the following will end up with a hanging python interpreter:
>>> import subprocess
>>> subprocess.run(['ssh', 'unix-machine', 'uname'])
>>> print('after run')
(Weirdly enough, no other python interpreter started in the same powershell session (after killing the hanging python process) is accepting interactive input).
hangtest.py
and execute them via python hangtest.py
, the interpreter prints after run
and exits.python -i hangtest.py
, the interpreter prints after run
, and hangs.stdin=subprocess.DEVNULL
or stdin=subprocess.PIPE
, to the run
-parameters, the interpreter prints after run
and does not hang.Without overriding stdin
, the subprocess, i.e. ssh
, and python
share the same file pointer. It seems that stdin
is configured in a way that unexpected by the interpreter and messes with python
's way to read from sys.stdin
. For example, running the following program would hang after printing the prompt enter something>
:
import subprocess
subprocess.run(['ssh', 'unix-machine', 'uname'])
print('after run')
print(input('enter something> '))
But this program will print out individual characters read from sys.stdin
:
import sys
import subprocess
subprocess.run(['ssh', 'unix-machine', 'uname'])
print('after run')
while True:
print(read(sys.stdin))
Not sure about the conclusion yet.
[Edit]
It might well be that ssh
configures the stdin descriptor in a way that is unexpected by python.
[Edit 2]
Until the real cause of the described problems is found, it seems that, on Windows, we should never let ssh
touch a stdin
-file descriptor that is also used by the python
interpreter itself. I.e. always specify stdin=PIPE
or stdin=DEVNUL
or similar and feed input into the resulting ssh-input descriptor ourselves.
This all sounds like we understand how to avoid the problem. What I do not understand is what a fix could look like.
We need a fix that is SSH-specific. But the changes in handling described here seem only possible deep in the Runner code. Runner.run()
docs say
stdin : file-like, bytes, Queue, or None
If stdin is a file-like, it will be directly used as stdin for the
subprocess. The caller is responsible for writing to it and closing it.
If stdin is a bytes, it will be fed to stdin of the subprocess.
If all data is written, stdin will be closed.
If stdin is a Queue, all elements (bytes) put into the Queue will
be passed to stdin until None is read from the queue. If None is read,
stdin of the subprocess is closed.
Looking further, ThreadedRunner
docs clarify
If stdin is None, nothing will be sent to stdin of the subprocess. More precisely, `subprocess.Popen` will be called with `stdin=None`.
That seems to imply this can only be fixed by supporting an additional argument type/value for stdin
in the Runner code hierarchy, which is then used by a select code path for ssh client execution on windows (only).
Ping @christian-monch
This all sounds like we understand how to avoid the problem. What I do not understand is what a fix could look like.
We need a fix that is SSH-specific. But the changes in handling described here seem only possible deep in the Runner code. [...] That seems to imply this can only be fixed by supporting an additional argument type/value for
stdin
in the Runner code hierarchy, which is then used by a select code path for ssh client execution on windows (only). [...]
I think it can and should be fixed by calling the runner differently when executing ssh
in Windows. The current runner code supports the "solution" outlined above (btw: the implementation of SSHRemoteIO
in datalad-core does not use only the runner for ssh
-subprocesses, but also direct ssh invocation). I will dig into a solution on my Windows machine.
[...] supporting an additional argument type/value for stdin in the Runner code hierarchy [...]
Although this might not be required for the solution of this problem, it might be a good idea to support DEVNULL
. A workaround would be to pass an empty byte-array, i.e. stdin=b''
.
I am on it.
Isolating the interpreters stdin-descriptor from ssh
does indeed solve the problem in Windows. For example, the hanging code from above:
>>> import datalad >>> from datalad.support.sshconnector import SSHManager as sshman >>> sm = sshman() >>> url = 'ssh://sshuser@datalad-test-ria:2222' >>> con = sm.get_connection(url) >>> con('uname') [DEBUG] NoMultiplexSSHConnection(sshri=<<SSHRI(hostname++47 chars++er')>>) is used to run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] [DEBUG] Run ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] (protocol_class=StdOutErrCapture) (cwd=None) [DEBUG] Finished ['C:\\Windows\\System32\\OpenSSH\\ssh.exe', '-p', '2222', '-i', 'C:\\DLTMP\\datalad_tests_id_rsa', 'sshuser@datalad-test-ria', 'uname'] with status 0 ('Linux\n', '') >>> <hangs>
Will not hang, with the line:
con('uname')
is replaced with
con('uname', stdin=b'')
I am looking into the code-paths that lead to ssh touching the interpreter's stdin-stream to see whether we can prevent that in the new ria-remote implementation, without any changes in datalad-core
Before this is closed, we must file a companion issue in datalad-core.
Reopening: Some breakage is fixed, but SSHRemoteIO
is still not functional. This is only the case with #69
This is an interim conclusion from https://github.com/datalad/datalad-ria/pull/58
A test tries to run
init_remote()
for anora
remote, not even expecting it to work, but error out due to insufficiently meet preconditions. But it just hangs.This may be due to the fact that the target connection is via SSH and the target port is ignored by the implementation of
SSHRemoteIO
from core. but the same code passes on mac and linux.This could be investigated by deploying a matching ssh config that declares the port.