Open harmic opened 3 years ago
Isn't FromRawFd
is primarily marked unsafe because you could end up with two objects believing that they own the FD, and thus end up closing the underlying FD twice, possibly closing some other resource that re-used the FD number in between?
I need to track down the source code, but I assume that your call to TcpStream::shutdown
prevents TcpStream
's Drop
from closing the FD a second time. The problem here is that you're probably still closing the FD twice because you took the value with if let Some(tcp) = self.tcp.take()
, and tcp
will go out of scope at the bottom of the loop, be dropped, and close the FD.
Is there any reason you need to manually call TcpStream::shutdown
? Isn't the stream already dropped at the bottom of the loop? If you do need to call it manually because Drop
doesn't work, you might be able to use mem::forget
or mem::ManuallyDrop
.
https://doc.rust-lang.org/std/mem/fn.forget.html
Good find though, this definitely seems like it could be an issue.
@harmic Ok, after tracking down where std::net::TcpStream
bottoms out, it does indeed simply close the file descriptor when it's dropped.
So, I feel pretty confident that all you need to do is call self.tcp.take()
to drop the TcpStream
and close the FD.
I also read through the libssh2
C code and confirmed that this issue in ssh2-rs
will leak memory if libssh2_session_free
returns LIBSSH2_ERROR_EAGAIN
and isn't retried. However, if you look at session_free in libssh2
, the only non-zero error code it returns is LIBSSH2_ERROR_EAGAIN
. So, I think this is only an issue for non-blocking implementations. Can you let me know if you agree?
Since ssh2-rs
does not provide a non-blocking implementation, I think we will need to expose a wrapper around libssh2_session_free
that non-blocking clients can use to properly free the session. The Drop
implementation can check to see if the free method was ever called (we can store a boolean) and skip calling free if so. If free hasn't been manually called we can call it and perform a "dirty" shutdown if it returns LIBSSH2_ERROR_EAGAIN
. Of course, once the free method is manually called no other method calls on Session
can succeed, so I suppose every method will need to be guarded. The best way to handle this is probably to wrap SessionInner
in an Option
:
#[derive(Clone)]
pub struct Session {
inner: Arc<Mutex<Option<SessionInner>>>,
}
I'll work up a PR implementing this idea unless you already have something started.
Thanks for taking a look at this!
So, I think this is only an issue for non-blocking implementations. Can you let me know if you agree?
Actually no - in blocking mode, if you set a timeout, then LIBSSH2_ERROR_TIMEOUT
can also be returned. The path which returns this is the code expanded by macro BLOCK_ADJUST
which in turn calls _libssh2_wait_socket
- which can return the timeout error.
My strong recommendation is to always set a timeout in blocking mode, because if you don't, some kinds of network problems will cause it to hang indefinitely.
Is there any reason you need to manually call TcpStream::shutdown?
My recollection is that just closing the FD without calling shutdown did not solve the problem. Unfortunately I did not take good enough notes of everything I tried so I can't swear to that. I will test it again and report back. I must admit that it does not really make sense - you would think closing the FD would be enough.
I agree more needs to be done to facilitate async wrappers. I had the idea to provide an into_inner()
method that consumes the session and returns the underlying libssh2 handle such that the async wrapper could then deal with de-allocating it. Your idea is probably less error prone though.
I have not started on any PR yet.
Actually no - in blocking mode, if you set a timeout, then LIBSSH2_ERROR_TIMEOUT can also be returned. The path which returns this is the code expanded by macro BLOCK_ADJUST which in turn calls _libssh2_wait_socket - which can return the timeout error.
Ah, I see it now... I missed that the first time I read through the code.
On Unix TcpStream::shutdown
calls shutdown(2). The main difference between calling shutdown(fd, SHUT_RDWR)
and just closing the file descriptor is that shutdown will close the socket channels even if there are other file descriptors referring to the socket. I can't imagine that you duped the FD or forked though, so I think calling shutdown should be equivalent to closing the FD.
If you're able to reproduce an issue with closing the FD and not calling shutdown that would be a very interesting data point. Let me know if you get a chance to try it out :)
Thinking about it some more: if you close the FD, then it is available to be re-used straight away. Meanwhile libssh2 is still using it for the session it is trying to shut down.
In my application I am making connections to many many other machines, I can imagine that the FD could be re-used straight away.
Calling shutdown leaves that FD still open, but returning errors to libssh2 whenever it tries to read or write to it.
Ah, yes - that could definitely cause issues. This is really quite tricky since libssh2
combines freeing memory with IO to close out the session cleanly.
I was trying to find other wrapper libs to see how they handle this. The ParallelSSH python lib (which wraps libssh2
) calls the disconnect method below when the python object holding the session is deleted:
def disconnect(self):
"""Attempt to disconnect session.
Any errors on calling disconnect are suppressed by this function.
"""
self._keepalive_greenlet = None
if self.session is not None:
try:
self._disconnect_eagain()
except Exception:
pass
self.session = None
self.sock = None
if isinstance(self._proxy_client, SSHClient):
self._proxy_client.disconnect()
Later when the actual libssh2
session object is freed it calls session_free.
It looks like they're closing the file descriptor, not shutting it down and keeping it open until after the session_free call. Nonetheless, the idea is very similar to your original proposal.
Interestingly, even some of the official libssh2 examples don't seem to do anything special to handle libssh2_session_free
returning LIBSSH2_ERROR_TIMEOUT
.
I think I'll reach out over the libssh2
mailing list to see if any of the maintainers have official advice on how to handle this and report back.
Looking at the libssh2 example you linked, they call libssh2_session_disconnect
first, then after that no longer returns EAGAIN
they call libssh2_session_free
. In that scenario libssh2_session_free
probably won't return EAGAIN
(based on a quick scan of the libssh2 source).
I guess that is another possible approach for asynchronous shutdown: ssh2-rs does expose disconnect
so an async user could call that first and be reasonably assured that drop won't block. The down side is that it makes the it quite brittle to changes in libssh2 as it is an assumption that libssh2_session_free
won't block.
And you still have the problem that if the link has gone, you need to time out, and then shut down the underlying socket, in order to get libssh2 to complete the disconnection.
I looked at the source for the perl libssh2 wrapper (which I have used for years) - it also does not handle this correctly :)
Hmm, maybe I am misreading the source, but why do you think libssh2_session_free
probably won't return EAGAIN
if libssh2_session_disconnect
succeeds first? It seems like all disconnect does is send a disconnect message to the server - it doesn't actually send any messages to close open channels, so free will still send those and potentially block.
There's a bit of chatter on the mailing list and github that seems to suggest libssh2_session_free
returning EAGAIN
is a common source of memory leaks. A few examples:
There are also two PRs that attempted to add a new method - libssh2_set_socket_disconnected
to "tell the library that no further interaction with the remote server should be attempted." Both seem to have stalled out due to lack of bandwidth to update and review. Perhaps it's worth trying to push this forward in libssh
.
Hmmm. You might be right. I was going by this statement in the ssh spec in relation to SSH_MSG_DISCONNECT
:
The sender MUST NOT send or receive any data after this message, and the recipient MUST NOT accept any data after receiving this message.
But it looks like you are right - libssh2 sends SSH_MSG_DISCONNECT
but does not seem to do anything to prevent further interaction after that.
Based on that my guess is that libssh2
expects the user to close all open channels before calling disconnect and then free. If no data is supposed to be sent after a disconnect, it probably is appropriate to shut down the socket between disconnect and free as well in case there are any lingering channels that haven't been closed yet (since freeing the session would perform IO for the channels).
I quickly read through session_free
in libssh2
and I don't see any way that it could return EAGAIN
if all channels (including listeners) have already been closed - there doesn't seem to be any other IO baked in.
I'm using
ssh2-rs
in an async project (viaasync-ssh2
), and I have stumbled across a problem with the way sessions are freed.The
Drop
impl forSessionInner
just callslibssh2_session_free
and disregards the return value. This is a problem in both blocking and nonblocking modes:libssh2_session_free
is likely to returnLIBSSH2_ERROR_EAGAIN
and need to be called again when the socket is readyLIBSSH2_ERROR_TIMEOUT
(and in fact if the caller has not set a timeout, it may hang indefinitely if there is a network problem).In both cases, if you don't repeatedly call it until you get a 0 returned, then the resources are not released.
In my experience so far, in case of network disturbance it can be pretty difficult to get libsshb2 to properly release the session. The only thing that has worked for me so far is this:
So, if there is a timeout when calling
libssh2_session_free
, we shut down the socket and try again. The next time around libssh2 realizes the socket is shut down and finishes the cleanup.This biggest problem with this is that since tcp is
Box<dyn AsRawFd>
, we can't callshutdown
on it directly, all we can do is get the FD from it, and then create a newTcpStream
around it - which is not sound, because then there are twoTcpStream
s with the same FD. In my testing this has not caused an issue yet, but the documentation explicitly states not to do that. Also, theoretically the the FD might be associated with some other stream type that implementsAsRawFd
, (eg.UnixStream
).The above is also not good from an async point of view, because the drop fn blocks the thread for potentially a long time. To make it work better in an async context some more work could be required, but that probably should be for another issue.
Finally - whatever approach is used here needs to be replicated for Windows also.
I'm happy to make a PR, if someone can suggest how to avoid the unsoundness described above.