Leaking LIBSSH2_SESSION objects

alexcrichton / ssh2-rs

Rust bindings for libssh2

https://docs.rs/ssh2

Apache License 2.0

486 stars 147 forks source link

Leaking LIBSSH2_SESSION objects #220

Open harmic opened 3 years ago

harmic commented 3 years ago

I'm using ssh2-rs in an async project (via async-ssh2), and I have stumbled across a problem with the way sessions are freed.

The Drop impl for SessionInner just calls libssh2_session_free and disregards the return value. This is a problem in both blocking and nonblocking modes:

In non-blocking mode, libssh2_session_free is likely to return LIBSSH2_ERROR_EAGAIN and need to be called again when the socket is ready
In blocking mode, if the caller has set a timeout then it may return LIBSSH2_ERROR_TIMEOUT (and in fact if the caller has not set a timeout, it may hang indefinitely if there is a network problem).

In both cases, if you don't repeatedly call it until you get a 0 returned, then the resources are not released.

In my experience so far, in case of network disturbance it can be pretty difficult to get libsshb2 to properly release the session. The only thing that has worked for me so far is this:

impl Drop for SessionInner {
    fn drop(&mut self) {
        self.set_blocking(true);
        loop {
            let rc = unsafe { raw::libssh2_session_free(self.raw) };
            if rc == 0 {
                break;
            } else {
                // Timeout - drop the transport,
                // which will hopefully cause the socket to close so we can try again
                if let Some(tcp) = self.tcp.take() {
                    let tcp = unsafe { std::net::TcpStream::from_raw_fd(tcp.as_raw_fd()) };
                    let _ = tcp.shutdown(std::net::Shutdown::Both);
                }
            }
        }
    }
}

So, if there is a timeout when calling libssh2_session_free, we shut down the socket and try again. The next time around libssh2 realizes the socket is shut down and finishes the cleanup.

This biggest problem with this is that since tcp is Box<dyn AsRawFd>, we can't call shutdown on it directly, all we can do is get the FD from it, and then create a new TcpStream around it - which is not sound, because then there are two TcpStreams with the same FD. In my testing this has not caused an issue yet, but the documentation explicitly states not to do that. Also, theoretically the the FD might be associated with some other stream type that implements AsRawFd, (eg. UnixStream).

The above is also not good from an async point of view, because the drop fn blocks the thread for potentially a long time. To make it work better in an async context some more work could be required, but that probably should be for another issue.

Finally - whatever approach is used here needs to be replicated for Windows also.

I'm happy to make a PR, if someone can suggest how to avoid the unsoundness described above.

emesterhazy commented 3 years ago

Isn't FromRawFd is primarily marked unsafe because you could end up with two objects believing that they own the FD, and thus end up closing the underlying FD twice, possibly closing some other resource that re-used the FD number in between?

I need to track down the source code, but I assume that your call to TcpStream::shutdown prevents TcpStream's Drop from closing the FD a second time. The problem here is that you're probably still closing the FD twice because you took the value with if let Some(tcp) = self.tcp.take(), and tcp will go out of scope at the bottom of the loop, be dropped, and close the FD.

Is there any reason you need to manually call TcpStream::shutdown? Isn't the stream already dropped at the bottom of the loop? If you do need to call it manually because Drop doesn't work, you might be able to use mem::forget or mem::ManuallyDrop.

https://doc.rust-lang.org/std/mem/fn.forget.html

Good find though, this definitely seems like it could be an issue.

emesterhazy commented 3 years ago

@harmic Ok, after tracking down where std::net::TcpStream bottoms out, it does indeed simply close the file descriptor when it's dropped.

So, I feel pretty confident that all you need to do is call self.tcp.take() to drop the TcpStream and close the FD.

I also read through the libssh2 C code and confirmed that this issue in ssh2-rs will leak memory if libssh2_session_free returns LIBSSH2_ERROR_EAGAIN and isn't retried. However, if you look at session_free in libssh2, the only non-zero error code it returns is LIBSSH2_ERROR_EAGAIN. So, I think this is only an issue for non-blocking implementations. Can you let me know if you agree?

Since ssh2-rs does not provide a non-blocking implementation, I think we will need to expose a wrapper around libssh2_session_free that non-blocking clients can use to properly free the session. The Drop implementation can check to see if the free method was ever called (we can store a boolean) and skip calling free if so. If free hasn't been manually called we can call it and perform a "dirty" shutdown if it returns LIBSSH2_ERROR_EAGAIN. Of course, once the free method is manually called no other method calls on Session can succeed, so I suppose every method will need to be guarded. The best way to handle this is probably to wrap SessionInner in an Option:

#[derive(Clone)]
pub struct Session {
    inner: Arc<Mutex<Option<SessionInner>>>,
}

I'll work up a PR implementing this idea unless you already have something started.

harmic commented 3 years ago

Thanks for taking a look at this!

So, I think this is only an issue for non-blocking implementations. Can you let me know if you agree?

Actually no - in blocking mode, if you set a timeout, then LIBSSH2_ERROR_TIMEOUT can also be returned. The path which returns this is the code expanded by macro BLOCK_ADJUST which in turn calls _libssh2_wait_socket - which can return the timeout error.

My strong recommendation is to always set a timeout in blocking mode, because if you don't, some kinds of network problems will cause it to hang indefinitely.

Is there any reason you need to manually call TcpStream::shutdown?

My recollection is that just closing the FD without calling shutdown did not solve the problem. Unfortunately I did not take good enough notes of everything I tried so I can't swear to that. I will test it again and report back. I must admit that it does not really make sense - you would think closing the FD would be enough.

I agree more needs to be done to facilitate async wrappers. I had the idea to provide an into_inner() method that consumes the session and returns the underlying libssh2 handle such that the async wrapper could then deal with de-allocating it. Your idea is probably less error prone though.

I have not started on any PR yet.

emesterhazy commented 3 years ago

Actually no - in blocking mode, if you set a timeout, then LIBSSH2_ERROR_TIMEOUT can also be returned. The path which returns this is the code expanded by macro BLOCK_ADJUST which in turn calls _libssh2_wait_socket - which can return the timeout error.

Ah, I see it now... I missed that the first time I read through the code.

On Unix TcpStream::shutdown calls shutdown(2). The main difference between calling shutdown(fd, SHUT_RDWR) and just closing the file descriptor is that shutdown will close the socket channels even if there are other file descriptors referring to the socket. I can't imagine that you duped the FD or forked though, so I think calling shutdown should be equivalent to closing the FD.

If you're able to reproduce an issue with closing the FD and not calling shutdown that would be a very interesting data point. Let me know if you get a chance to try it out :)

harmic commented 3 years ago

Thinking about it some more: if you close the FD, then it is available to be re-used straight away. Meanwhile libssh2 is still using it for the session it is trying to shut down.

In my application I am making connections to many many other machines, I can imagine that the FD could be re-used straight away.

Calling shutdown leaves that FD still open, but returning errors to libssh2 whenever it tries to read or write to it.

emesterhazy commented 3 years ago

Ah, yes - that could definitely cause issues. This is really quite tricky since libssh2 combines freeing memory with IO to close out the session cleanly.

I was trying to find other wrapper libs to see how they handle this. The ParallelSSH python lib (which wraps libssh2) calls the disconnect method below when the python object holding the session is deleted:

    def disconnect(self):
        """Attempt to disconnect session.

        Any errors on calling disconnect are suppressed by this function.
        """
        self._keepalive_greenlet = None
        if self.session is not None:
            try:
                self._disconnect_eagain()
            except Exception:
                pass
            self.session = None
        self.sock = None
        if isinstance(self._proxy_client, SSHClient):
            self._proxy_client.disconnect()

Later when the actual libssh2 session object is freed it calls session_free.

It looks like they're closing the file descriptor, not shutting it down and keeping it open until after the session_free call. Nonetheless, the idea is very similar to your original proposal.

Interestingly, even some of the official libssh2 examples don't seem to do anything special to handle libssh2_session_free returning LIBSSH2_ERROR_TIMEOUT.

I think I'll reach out over the libssh2 mailing list to see if any of the maintainers have official advice on how to handle this and report back.

harmic commented 3 years ago

Looking at the libssh2 example you linked, they call libssh2_session_disconnect first, then after that no longer returns EAGAIN they call libssh2_session_free. In that scenario libssh2_session_free probably won't return EAGAIN (based on a quick scan of the libssh2 source).

I guess that is another possible approach for asynchronous shutdown: ssh2-rs does expose disconnect so an async user could call that first and be reasonably assured that drop won't block. The down side is that it makes the it quite brittle to changes in libssh2 as it is an assumption that libssh2_session_free won't block.

And you still have the problem that if the link has gone, you need to time out, and then shut down the underlying socket, in order to get libssh2 to complete the disconnection.

I looked at the source for the perl libssh2 wrapper (which I have used for years) - it also does not handle this correctly :)

emesterhazy commented 3 years ago

Hmm, maybe I am misreading the source, but why do you think libssh2_session_free probably won't return EAGAIN if libssh2_session_disconnect succeeds first? It seems like all disconnect does is send a disconnect message to the server - it doesn't actually send any messages to close open channels, so free will still send those and potentially block.

There's a bit of chatter on the mailing list and github that seems to suggest libssh2_session_free returning EAGAIN is a common source of memory leaks. A few examples:

There are also two PRs that attempted to add a new method - libssh2_set_socket_disconnected to "tell the library that no further interaction with the remote server should be attempted." Both seem to have stalled out due to lack of bandwidth to update and review. Perhaps it's worth trying to push this forward in libssh.

harmic commented 3 years ago

Hmmm. You might be right. I was going by this statement in the ssh spec in relation to SSH_MSG_DISCONNECT:

The sender MUST NOT send or receive any data after this message, and the recipient MUST NOT accept any data after receiving this message.

But it looks like you are right - libssh2 sends SSH_MSG_DISCONNECT but does not seem to do anything to prevent further interaction after that.

emesterhazy commented 3 years ago

Based on that my guess is that libssh2 expects the user to close all open channels before calling disconnect and then free. If no data is supposed to be sent after a disconnect, it probably is appropriate to shut down the socket between disconnect and free as well in case there are any lingering channels that haven't been closed yet (since freeing the session would perform IO for the channels).

I quickly read through session_free in libssh2 and I don't see any way that it could return EAGAIN if all channels (including listeners) have already been closed - there doesn't seem to be any other IO baked in.