Unexpected disconnect behavior leading to exit code 1 since 2.11.0

MrEasy commented 7 months ago

Version

2.11.0, 2.12.0, 2.12.1

Bug description

Hi team,

After a recent update we encounter a change in behavior with regard to the disconnect procedure.

Last version with expected result: 2.10.0 Versions with unexpected results: 2.11.0, 2.12.0, 2.12.1 (latest release)

Starting with 2.11.0 the following issue can be seen: When connecting to the SSH server and issuing a command, in about 50% of all cases, the disconnect is not gracefully, but gives a normal exit with code 0, but then followed by another "unexpectedly closed" result, leading to an exit status of 1. This is in our case problematic, since we interpret the exit code as having issued a successful command.

Environment: Windows (meh) SSH-client: PuTTY/plink, tested versions 0.70, 0.80 (latest release)

The following example plink call shows the issue in about 50% of all executions (alternative with private-key auth as well):

PS C:\xxx\> plink.exe -batch -l admin -P 8122 -noagent -pwfile "C:\xxx\pwd.txt" -v localhost "ls #execute some command"; Write-Host "`r`nLAST EXIT CODE: "$LASTEXITCODE

Looking up host "localhost" for SSH connection
Connecting to ::1 port 8122
Connected to ::1
We claim version: SSH-2.0-PuTTY_Release_0.80
Remote version: SSH-2.0-APACHE-SSHD-2.11.0
Using SSH protocol version 2
Doing ECDH key exchange with curve nistp256, using hash SHA-256 (unaccelerated)
Host key fingerprint is:
ssh-rsa 2048 SHA256:<sha here>
Initialised AES-256 SDCTR (AES-NI accelerated) outbound encryption
Initialised HMAC-SHA-256 (unaccelerated) outbound MAC algorithm
Initialised AES-256 SDCTR (AES-NI accelerated) inbound encryption
Initialised HMAC-SHA-256 (unaccelerated) inbound MAC algorithm
Using username "admin".
Attempting keyboard-interactive authentication
Keyboard-interactive authentication prompts from server:
Access granted
EOpening main session channel
nd of keyboard-interactive prompts from server
Opened main channel
Started a shell/command
<command output here>
Server sent command exit status 0
Server unexpectedly closed network connection
FATAL ERROR: Server unexpectedly closed network connection

LAST EXIT CODE:  1

Full output of -sshrawlog attached: sshraw.log

After setting the sshd logs to TRACE we could see this:

close(Nio2Session[local=/127.0.0.1:8122, remote=/127.0.0.1:51820]) Closing gracefully 
doCloseImmediately(Nio2Session[local=/127.0.0.1:8122, remote=/127.0.0.1:51820]) socket=sun.nio.ch.WindowsAsynchronousSocketChannelImpl[closed] closed

handleReadCycleFailure(Nio2Session[local=/127.0.0.1:8122, remote=/127.0.0.1:51818]) IOException after 413392003 nanos at read cycle=14: The specified network name is no longer available

As if it closed the connection already but then tries to do something with it again. Also put to log-output below.

If we execute the command long enough to get a successful response (only exit code 0, without it being followed by an exit code 1 immediatly), we see this in the debug logs:

org.apache.sshd.common.io.nio2.Nio2Session  Thread-18070  [org.apache.sshd.osgi:2.11.0]         |  |  |  |  close(Nio2Session[local=/[0:0:0:0:0:0:0:1]:8122, remote=/[0:0:0:0:0:0:0:1]:59243]) Closing gracefully

Actual behavior

In about 50% of executions, the disconnect fails to execute gracefully and exit code 1 is the result.

Expected behavior

Graceful disconnect with exit code 0.

Relevant log output

close(Nio2Session[local=/127.0.0.1:8122, remote=/127.0.0.1:51820]) Closing gracefully 
doCloseImmediately(Nio2Session[local=/127.0.0.1:8122, remote=/127.0.0.1:51820]) socket=sun.nio.ch.WindowsAsynchronousSocketChannelImpl[closed] closed

handleReadCycleFailure(Nio2Session[local=/127.0.0.1:8122, remote=/127.0.0.1:51818]) IOException after 413392003 nanos at read cycle=14: The specified network name is no longer available

Other information

No response

tomaswolf commented 7 months ago

Can you provide a full server-side log at DEBUG or TRACE level including timestamps and everything from the moment the sever sent the EOF? (I.e., of the EOF message, exit code message, and everything after that.)

From just these log excerpts it's not possible to even guess what might going on.

MrEasy commented 7 months ago

Can you provide a full server-side log at DEBUG or TRACE level (...)

Here comes the problem with that: As soon as debug logging is active, the problem is not reproducible anymore (at least on lots of tries). So it seems to be a race-condition. Will try to reproduce it and then provide such a log.

For what it's worth, here is a log without the problem arising (exit code 0) with the Closing gracefully: sshdserver-exit-code-0.log

MrEasy commented 7 months ago

(...) Will try to reproduce it and then provide such a log. (...)

Attaching a log with problem occuring. To note again: Very less likely to occur with debug logging active.

Doing a diff between the problematic and successful ones shows the following (left=exit-code-0, right=exit-code-1). The order of the 2 unregisterChannel and the 2 close calls differ.

Log: sshdserver-exit-code-1.log

tomaswolf commented 7 months ago

From the original report it already appears that the server closes the network connection before the final SSH_MSG_CHANNEL_CLOSE handshake. However, I'm missing a number of other expected log lines in these logs. Are these filtered somehow? (Possibly through the logging config, which might set DEBUG or TRACE only for certain paths?) If so, please provide unfiltered logs.

In both logs (successful or not) it is in any case very strange to see a session closing immediately after having sent the SSH_MSG_CHANNEL_CLOSE message. That should not happen. It looks as if the session gets shut down before the channel has been fully closed. Did you re-configure CoreModuleProperties.CHANNEL_CLOSE_TIMEOUT?

MrEasy commented 7 months ago

Struggled a bit with the logger config, so log files contain only org.apache.sshd.server.session so far. Will try to include the rest as well.

Did you re-configure CoreModuleProperties.CHANNEL_CLOSE_TIMEOUT?

No, CHANNEL_CLOSE_TIMEOUT is not set and should default to 5 sec.

MrEasy commented 7 months ago

Let it run in a loop until occurence - with activated trace logs for all sshd-packages, it took 674 iterations to hit it.

Attaching log, including last successful commands and finishing with the problematic ones. There are 2 exceptions showing-up at the end, especially org.apache.sshd.common.SshException: Write attempt on closing session: SSH_MSG_CHANNEL_CLOSE

logs-exit-1.zip

Test-script (just for info): plink-test.txt

tomaswolf commented 7 months ago

Thank you. That is useful and gives me a starting point to analyze this deeper. The log confirms my initial suspicion: it looks as if closing the channel by mistake also closes the session. Plus there's a race between this closing of the session (which is too early) and sending the CLOSE message for the channel. If the session closes earlier, the client doesn't get the CLOSE message and complains, plus we get that "write on closing session" exception on the server. If sending the CLOSE is earlier, the client might be happy, especially if it manages to send back its own CLOSE before the network connection goes down.

I'll try to come up with some unit test, and I'll have to dig deeper to figure out why the session is closed at all at that point. It should not be closed yet.

Not sure the race and bug are new since 2.10.0; but it's possible that changes made since then make an old bug surface now more frequently.

MrEasy commented 7 months ago

Thank you!

Not knowing the code in detail, but maybe this change in 2.11.0 somehow could be related (just a wild guess based on changes there) https://github.com/apache/mina-sshd/issues/410

tomaswolf commented 7 months ago

The issue is Windows-specific. The exception message "The specified network name is no longer available" comes from the Windows Java NIO2 implementation.
The exception indicates that the client unexpectedly closed the session. Thus the server's read waiting on the client's SSH_MSG_CHANNEL_CLOSE is interrupted, which gives this exception. Then the server-side session is closed.
I am not able to reproduce this error with plink 0.77 or 0.80 running against an Apache MINA SSHD 2.11.0 server.
I do not see any bug or race in the channel closing code in Apache MINA SSHD (server part). I do see two (minor) bugs in our client-side code, but they cannot and do not affect the server-side handling of shells/commands.

Can you share a minimal server implementation with which you can reproduce this problem?

MrEasy commented 7 months ago

Hi, Created a simple example server code and can confirm, that I am also not able to reproduce the issue with that.

The SSH-server which shows the issue is the one that Apache Karaf starts-up for its client-shell. You see I oriented on their way of starting it up, code, difference between my simple test and Karaf-integrated SSH-server are the implementations for authentication etc. so issue could also lay in the combination with those. Will look into that a bit more.

tomaswolf commented 7 months ago

We may be getting somewhere now... I find the ShellFactoryImpl interesting.

This destroy() method kills the SSH session. (Immediate close; which is what we see in your logs.) I suspect this is the culprit: if this closing of the session occurs before the client has sent back its SSH_MSG_CHANNEL_CLOSE reply, then putty won't be happy.

Now destroy() is normally being called when the ChannelSession has been closed (the SSH_MSG_CHANNEL_CLOSE request/reply dance having been done). But I also see it being passed to the sessionFactory in line 106/107. I don't know when that runs. If that closes the SSH session before the SSH_MSG_CHANNEL_CLOSE exchange has completed, then putty will complain.

So let me modify my point (2) from above: it's not the client that unexpectedly closes the session, it's the server itself via this destroy().

MrEasy commented 7 months ago

Ah, remembering this change https://github.com/apache/karaf/pull/1427 which was added after upgrade to sshd 2.5.1. Maybe that should not be done anymore in combination with the recent versions.

tomaswolf commented 7 months ago

Line 122 itself is not a problem. I don't know why it was added or whether it is still necessary; I don't remember any bug report about that Karaf issue to Apache MINA SSHD. In any case, when destroy(ChannelSession channel) is called from Apache MINA SSHD, the SSH_MSG_CHANNEL_CLOSE exchange has been fully done. So unless you know it's no longer needed and why I would recommend leaving line 122 as is.

I'm worried about line 107, which passes this destroy() method as a "closeCallback". Without digging what where why and on which thread calls that closeCallback when, I would suspect that it gets invoked by Karaf while Apache MINA SSHD is still waiting for the client's SSH_MSG_CHANNEL_CLOSE to arrive. I would try passing null as a callback to SessionFactory.create() and leave the rest to the Apache MINA SSHD framework.

MrEasy commented 7 months ago

Thanks for this hint!

I was testing with having the closeCallback null in line 107, but that did not change anything for me.

While looking around however, I came across closing of the session in ShellCommand and was wondering if there was accidentally the field named session closed instead of local variable also named session. Made this changes as a test: code and test works perfectly now - no wrong exit codes anymore. Now I'm not entirely sure this is the issue, maybe I just created a resource leak, but asked JB Onofré to share his thoughts on this.

tomaswolf commented 7 months ago

I guess we can close this issue then? I think it's clear by now that this is a race condition in the way Apache Karaf closes its shell/session, not some bug in Apache MINA SSHD.

MrEasy commented 7 months ago

Agree with that - thank you for your time.

apache / mina-sshd