datalad / datalad

Keep code, data, containers under control with git and git-annex
http://datalad.org
Other
526 stars 110 forks source link

ssh: we are "holding" stderr of the ssh mux process forbidding a clean exit if piped #5465

Open yarikoptic opened 3 years ago

yarikoptic commented 3 years ago

dedicated extract from https://github.com/datalad/datalad/issues/5460#issuecomment-790923286

repoducer demonstrating the issue is a stalling script like

$> cat ./chronic-halt
#!/bin/bash

set -eu
ps auxw | grep chronic-[s]tall | awk '{print $2;}' | xargs -r kill
rm -f $HOME/.cache/datalad/sockets/chronic-stall
(
pstree -p $$
echo "starting ssh mux"
ssh -fN -o ControlMaster=auto -o ControlPersist=15m -o ControlPath=$HOME/.cache/datalad/sockets/chronic-stall datasets.datalad.org
echo "done, exiting"
pstree -p $$
) 2>&1 | tee /tmp/chronic-stall.log

in interaction with #5464 we ending up with a stalled process. Since it is interaction with #5464 and we do take care about closing all the ssh mux processes we started, this issue should not manifest itself. But I still wonder if we should somehow take care about not dragging those pipes of ssh -fN with us (e.g. if we decide to change policy for those ssh processes to outlast our datalad process life time)

edit 1: adding 2>/dev/null is sufficient to resolve the stall. So it seems to be about the process inheriting stderr. edit 2: adding 2>&1 isn't sufficient. Redirecting 2 to a file - sufficient edit 3: added cleanup and used $HOME

yarikoptic commented 3 years ago

couldn't the WitlessRunner with Capture protocol be used here instead of a plain Popen?

yarikoptic commented 3 years ago

@mih @christian-monch @bpoldrack @jwodder - any ideas on the best ways to address this one?

mih commented 3 years ago

I would support a move to a runner-based execution. That it is not done already is just a long ignored relic.

yarikoptic commented 2 years ago

Ha! I was about to email

following question/email to openssh-unix-dev mindrot.org ``` Dear OpenSSH developers, Originally posted/troubleshooted on our DataLad project issue tracker: https://github.com/datalad/datalad/issues/5465 In DataLad we establish "control socket" for subsequent connections (primarily by git or git-annex) via ssh to a remote host. The problem came up from CRON jobs which used chronic command (by Joey Hess, shipped within moreutils in Debian) around our datalad command invocation. I think we had our datalad process stalling (forgot detail, sorry). I have troubleshooted it to the fact that the process/shell with ssh -fN invocation never exits if ssh's stderr is not redirected to a file. Here is a "reproducer" which would never exit: #!/bin/bash set -eu # to clean up from previous run ps auxw | grep chronic-[s]tall | awk '{print $2;}' | xargs -r kill rm -f $HOME/.cache/datalad/sockets/chronic-stall ( pstree -p $$ echo "starting ssh mux" ssh -fN -o ControlMaster=auto -o ControlPersist=15m -o ControlPath=$HOME/.cache/datalad/sockets/chronic-stall datasets.datalad.org #2>/tmp/chronic-stderr-out #2>/dev/null #2>&1 echo "done, exiting" pstree -p $$ ) 2>&1 | tee /tmp/chronic-stall.log The commented out redirection to a file (or /dev/null) mitigates the situation. I wonder if that is something expected and I just don't see through my lack of knowledge, and thus we just need to mitigate on DataLad side (take care about explicitly killing that ssh process?) or something to address on openssh side? My ssh from debian: $> apt-cache policy openssh-client openssh-client: Installed: 1:8.4p1-6 Candidate: 1:8.7p1-2 Version table: 1:8.7p1-2 900 900 http://deb.debian.org/debian bookworm/main amd64 Packages 600 http://http.debian.net/debian sid/main amd64 Packages *** 1:8.4p1-6 100 100 /var/lib/dpkg/status 1:8.4p1-5 100 100 http://deb.debian.org/debian bullseye/main amd64 Packages ```

but then saw that my openssh-client outdated, so upgraded to 1:8.7p1-2 and my reproducer no longer reproduces the situation!!! At the end of the run of the reproducer I still have

$> ps auxw | grep chronic-[s]tall
yoh      2269666  0.0  0.0   9876  2976 ?        Ss   11:04   0:00 ssh -fN -o ControlMaster=auto -o ControlPersist=15m -o ControlPath=/home/yoh/.cache/datalad/sockets/chronic-stall datasets.datalad.org

that ssh process running the background (I guess as expected since nothing said to kill it). Looking through Debian changelog (which seems to include upstream changes) found in 1:8.4p1-1 (so can't be relevant since I had 1:8.4p1-6, right?)

    - ssh(1): close stdin/out/error when forking after authentication
      completes ("ssh -f ...").

which sounds like the which could have been relevant... looking at changelog.gz which seems to be a git log dump found

commit 396d32f3a1a16e54df2a76b2a9b237868580dcbe
Author: djm@openbsd.org <djm@openbsd.org>
Date:   Sat Oct 3 09:22:26 2020 +0000

    upstream: There are lots of place where we want to redirect stdin,

    stdout and/or stderr to /dev/null. Factor all these out to a single
    stdfd_devnull() function that allows selection of which of these to redirect.
    ok markus@

    OpenBSD-Commit-ID: 3033ba5a4c47cacfd5def020d42cabc52fad3099

which in the https://github.com/openssh/openssh-portable

$> git describe --contains 396d32f3a1a16e54df2a76b2a9b237868580dcbe
V_8_5_P1~265

so if that is the one -- most likely the issue was fixed in 8.5p1 . Unfortunately https://snapshot.debian.org/binary/openssh-client/ does not have 8.5 or 8.6 releases. So I will just assume that it is fixed in 8.5p1 ;)

Unfortunately drogon where I encountered issue originally is still on stretch. I guess I should look into upgrading and hoping for a backport of 8.7 for bullseye (Debian stable) to come in some near future.

Meanwhile I still wonder if we should do something about this and mitigate on our end somehow