Closed MrEasy closed 7 months ago
Can you provide a full server-side log at DEBUG or TRACE level including timestamps and everything from the moment the sever sent the EOF? (I.e., of the EOF message, exit code message, and everything after that.)
From just these log excerpts it's not possible to even guess what might going on.
Can you provide a full server-side log at DEBUG or TRACE level (...)
Here comes the problem with that: As soon as debug logging is active, the problem is not reproducible anymore (at least on lots of tries). So it seems to be a race-condition. Will try to reproduce it and then provide such a log.
For what it's worth, here is a log without the problem arising (exit code 0) with the Closing gracefully
:
sshdserver-exit-code-0.log
(...) Will try to reproduce it and then provide such a log. (...)
Attaching a log with problem occuring. To note again: Very less likely to occur with debug logging active.
Doing a diff between the problematic and successful ones shows the following (left=exit-code-0, right=exit-code-1). The order of the 2 unregisterChannel
and the 2 close
calls differ.
From the original report it already appears that the server closes the network connection before the final SSH_MSG_CHANNEL_CLOSE handshake. However, I'm missing a number of other expected log lines in these logs. Are these filtered somehow? (Possibly through the logging config, which might set DEBUG or TRACE only for certain paths?) If so, please provide unfiltered logs.
In both logs (successful or not) it is in any case very strange to see a session closing immediately after having sent the SSH_MSG_CHANNEL_CLOSE message. That should not happen. It looks as if the session gets shut down before the channel has been fully closed. Did you re-configure CoreModuleProperties.CHANNEL_CLOSE_TIMEOUT
?
Struggled a bit with the logger config, so log files contain only org.apache.sshd.server.session
so far.
Will try to include the rest as well.
Did you re-configure CoreModuleProperties.CHANNEL_CLOSE_TIMEOUT?
No, CHANNEL_CLOSE_TIMEOUT
is not set and should default to 5 sec.
Let it run in a loop until occurence - with activated trace logs for all sshd-packages, it took 674 iterations to hit it.
Attaching log, including last successful commands and finishing with the problematic ones.
There are 2 exceptions showing-up at the end, especially org.apache.sshd.common.SshException: Write attempt on closing session: SSH_MSG_CHANNEL_CLOSE
Test-script (just for info): plink-test.txt
Thank you. That is useful and gives me a starting point to analyze this deeper. The log confirms my initial suspicion: it looks as if closing the channel by mistake also closes the session. Plus there's a race between this closing of the session (which is too early) and sending the CLOSE message for the channel. If the session closes earlier, the client doesn't get the CLOSE message and complains, plus we get that "write on closing session" exception on the server. If sending the CLOSE is earlier, the client might be happy, especially if it manages to send back its own CLOSE before the network connection goes down.
I'll try to come up with some unit test, and I'll have to dig deeper to figure out why the session is closed at all at that point. It should not be closed yet.
Not sure the race and bug are new since 2.10.0; but it's possible that changes made since then make an old bug surface now more frequently.
Thank you!
Not knowing the code in detail, but maybe this change in 2.11.0 somehow could be related (just a wild guess based on changes there) https://github.com/apache/mina-sshd/issues/410
Can you share a minimal server implementation with which you can reproduce this problem?
Hi, Created a simple example server code and can confirm, that I am also not able to reproduce the issue with that.
The SSH-server which shows the issue is the one that Apache Karaf starts-up for its client-shell. You see I oriented on their way of starting it up, code, difference between my simple test and Karaf-integrated SSH-server are the implementations for authentication etc. so issue could also lay in the combination with those. Will look into that a bit more.
We may be getting somewhere now... I find the ShellFactoryImpl interesting.
This destroy()
method kills the SSH session. (Immediate close; which is what we see in your logs.) I suspect this is the culprit: if this closing of the session occurs before the client has sent back its SSH_MSG_CHANNEL_CLOSE reply, then putty won't be happy.
Now destroy()
is normally being called when the ChannelSession
has been closed (the SSH_MSG_CHANNEL_CLOSE request/reply dance having been done). But I also see it being passed to the sessionFactory
in line 106/107. I don't know when that runs. If that closes the SSH session before the SSH_MSG_CHANNEL_CLOSE exchange has completed, then putty will complain.
So let me modify my point (2) from above: it's not the client that unexpectedly closes the session, it's the server itself via this destroy()
.
Ah, remembering this change https://github.com/apache/karaf/pull/1427 which was added after upgrade to sshd 2.5.1. Maybe that should not be done anymore in combination with the recent versions.
Line 122 itself is not a problem. I don't know why it was added or whether it is still necessary; I don't remember any bug report about that Karaf issue to Apache MINA SSHD. In any case, when destroy(ChannelSession channel)
is called from Apache MINA SSHD, the SSH_MSG_CHANNEL_CLOSE exchange has been fully done. So unless you know it's no longer needed and why I would recommend leaving line 122 as is.
I'm worried about line 107, which passes this destroy()
method as a "closeCallback". Without digging what where why and on which thread calls that closeCallback when, I would suspect that it gets invoked by Karaf while Apache MINA SSHD is still waiting for the client's SSH_MSG_CHANNEL_CLOSE to arrive. I would try passing null
as a callback to SessionFactory.create()
and leave the rest to the Apache MINA SSHD framework.
Thanks for this hint!
I was testing with having the closeCallback null in line 107, but that did not change anything for me.
While looking around however, I came across closing of the session in ShellCommand and was wondering if there was accidentally the field named session
closed instead of local variable also named session
. Made this changes as a test: code and test works perfectly now - no wrong exit codes anymore.
Now I'm not entirely sure this is the issue, maybe I just created a resource leak, but asked JB Onofré to share his thoughts on this.
I guess we can close this issue then? I think it's clear by now that this is a race condition in the way Apache Karaf closes its shell/session, not some bug in Apache MINA SSHD.
Agree with that - thank you for your time.
Version
2.11.0, 2.12.0, 2.12.1
Bug description
Hi team,
After a recent update we encounter a change in behavior with regard to the disconnect procedure.
Last version with expected result: 2.10.0 Versions with unexpected results: 2.11.0, 2.12.0, 2.12.1 (latest release)
Starting with 2.11.0 the following issue can be seen: When connecting to the SSH server and issuing a command, in about 50% of all cases, the disconnect is not gracefully, but gives a normal exit with code 0, but then followed by another "unexpectedly closed" result, leading to an exit status of 1. This is in our case problematic, since we interpret the exit code as having issued a successful command.
Environment: Windows (meh) SSH-client: PuTTY/plink, tested versions 0.70, 0.80 (latest release)
The following example plink call shows the issue in about 50% of all executions (alternative with private-key auth as well):
Full output of -sshrawlog attached: sshraw.log
After setting the sshd logs to TRACE we could see this:
As if it closed the connection already but then tries to do something with it again. Also put to log-output below.
If we execute the command long enough to get a successful response (only exit code 0, without it being followed by an exit code 1 immediatly), we see this in the debug logs:
Actual behavior
In about 50% of executions, the disconnect fails to execute gracefully and exit code 1 is the result.
Expected behavior
Graceful disconnect with exit code 0.
Relevant log output
Other information
No response