datalad / datalad-ria

Adds functionality for RIA stores to DataLad
http://datalad.org
Other
0 stars 1 forks source link

"Hanging" processes due to stale SSH socket #8

Open mih opened 4 years ago

mih commented 4 years ago

Just observed a datalad siblings call that would not return and produced no output. The reason was that for some reason there still was an active (but potentially or likely invalid) SSH master socket. git annex info is called inside and was waiting patiently for feedback from SSH that never came.

Not sure how to deal with this, but it might be the source of a range of "hangs forever" observations.

bpoldrack commented 4 years ago

FTR: In context of mentioned PR in git-annex-ria-remote I figured that the hanging wasn't during the default fetch siblings-configure does, but during annex-enableremote. Moreover that annex call seems to refer to a socket, that doesn't exist (anymore), but is listed in SSHManager._connections. At the same time another still open socket exists and I currently have no idea where that would come from, since I'm not aware of another connection required (all same host, user, etc.).

So, there's something fishy in connection sharing here. Looking into it.

bpoldrack commented 4 years ago

Update: There are (of course) several issues here.

However, one thing is pretty clear by now: SSHConnection's _opened_by_us doesn't really work, since it can consider only calls to its own open() method. If we, however, call annex (which is set up to share connections with "us") and it is opening a socket, we don't close it, explaining the "stale socket" observation.

bpoldrack commented 4 years ago

The second issue regarding my encounter in that RIA PR is not too clear yet. That enableremote is supposed to establish a second socket to the same target. It's two different ones overall, because of a different hash, because of use_bundled_annex is True for one and False for the other (not yet sure why). This works just fine locally, but on my box I don't authenticate by key to datalad-test but password. So, it might be the case on Travis, that it just waits for user input, which would imply that somehow it doesn't use the key in one case but in the other suggesting some confusion with ssh itself.

yarikoptic commented 2 years ago

if this issue something you still ran into any time recently? note that also git-annex had some number of fixes since then to address "hanging" of various actions.