Can `pyo3_asyncio::tokio::future_into_py` return values cause memory leaks?

decathorpe commented 1 year ago

I'm unable to reproduce this reliably (though I have published my code with which I attempted to do so, and linked it below).

It looks like values returned from futures passed to future_into_py might not always get their refcount decremented correctly, which leads to them not being garbage collected (and hence, their memory being "leaked", since it's no longer reachable from both the Rust and Python side).

My project implements a TCP server in user space with Rust and provides Python bindings for it, and this read method seems to leak memory (apparently because the bytes object yielded by the Future is never garbage collected, even after I have no more references to it, neither on the Rust side, nor on the Python side):

#[pymethods]
impl TcpStream {
    /// Read up to `n` bytes from the TCP stream.
    ///
    /// If the connection was closed, this returns an empty `bytes` object.
    fn read<'p>(&self, py: Python<'p>, n: u32) -> PyResult<&'p PyAny> {
        let (tx, rx) = oneshot::channel();

        self.event_tx
            .send(TransportCommand::ReadData(self.connection_id, n, tx)) // send request for data to the network stack
            .map_err(event_queue_unavailable)?;

        pyo3_asyncio::tokio::future_into_py(py, async move {
            let data = rx.await.map_err(connection_closed)?; // wait until network stack sends requested data
            let bytes: Py<PyBytes> = Python::with_gil(|py| PyBytes::new(py, &data).into_py(py));
            Ok(bytes)
        })
    }

    // other methods
}

Called from the Python side simply as:

# rw = TcpStream(...)
data = await rw.read(4096)

The bytes object returned by the Future has, surprisingly, a refcount of 2, as calling sys.getrefcount(data) returns 3 (the code for reproducing this issue, linked below, confirms this). This being a project that provides a long-running server that handles TCP connections, I'm able to feed it arbitrary amounts of data over the network, and apparently none of the bytes objects that are returned by awaiting Futures are freed, making it eat up to a few gigabytes of memory pretty fast.

When I tried to make a simple project to reproduce this issue, I managed to leak arbitrary amounts of memory, but as soon as the Python asyncio runtime yielded (for example, when I added a await asyncio.sleep(10) statement so I'd have more time to look at memory usage), the garbage collector immediately kicked in and started to clean stuff up.

Is it possible that pyo3_asyncio keeps hold of an additional reference to the Future's returned value, which might not consistently be decremented - or at least not early enough, or not in all circumstances?

🌍 Environment

Your operating system and version: Fedora Linux 37
Your python version: 3.11.0
How did you install python (e.g. apt or pyenv)? System /usr/bin/python3
Did you use a virtualenv? Yes
Your Rust version (rustc --version): 1.65.0
Your PyO3 version: 0.17.3
Have you tried using latest PyO3 master? No, master == 0.17.0

💥 Reproducing

I've only managed to partially reproduce this issue here: https://github.com/decathorpe/pyo3-asyncio-memory-leak

The project can be built with maturin and run with python3 test.py. However, due to reasons I don't understand, the garbage collector does sporadically kick in in this simplified project at runtime, and the GC definitely runs after asyncio.run() returns - whereas GC never seems to happen at all to the bytes objects returned from Futures in my real-world project ...

The affected function seems to be the TcpStream.read method here: https://github.com/decathorpe/mitmproxy_wireguard/blob/main/src/python/tcp_stream.rs#L37-L49

The bytes object returned by it don't seem to get garbage collected / deallocated, increasing memory usage by X bytes every time X bytes are read from a TCP socket.

A quick back-of-the-napkin calculation shows that after reading ~2 GB of data from the TCP socket (by awaiting the Future returned by the method pasted above), memory usage grows to ~2.6 GB, with it remaining completely stable even after letting the server idle (even if GC should definitely be able to deallocate these 2 GB at this point).

awestlake87 commented 1 year ago

Has this resulted in any swapping or problems at runtime aside from high memory usage? I'm not an expert on GC and I don't know much about Python's GC in particular, but I know some garbage collectors don't necessarily clean up resources eagerly. Python might decide that ~2G memory usage is fine on your server and wait awhile longer before taking action. If your computer starts swapping or crashing because of high memory usage, that's a definite indication of a memory leak.

The example you brought up where the bytes object has a refcount of 2 might not necessarily be an issue. The future is spawned on the tokio event loop and it's possible that tokio has not yet cleaned up the task / join handle that contains the result. You might be able to check and see if that refcount goes down after some time, possibly with Python's weakrefs.

decathorpe commented 1 year ago

I haven't seen any swapping - yet. But I'll try to run it for longer and feed it more data to see if things get eventually freed if there's enough memory pressure ... Still, the behaviour is kind of unexpected, given that equivalent code using python's built-in TCP Server doesn't show similar RAM use growth at all.

awestlake87 commented 1 year ago

Ok, it's also possible that Tokio is holding on to some of these references for longer than expected since spawned tasks also have some rules around cleanup. That is something that we can probably mitigate in pyo3-asyncio if we're able to narrow it down to that. It might be worth swapping the runtime out for async-std to see if you see the same behavior.

IIRC There's an async-std <-> tokio compatibility layer that should let you use exactly the same async code as you would for tokio if it requires a lot of changes. Might not be a perfect test that way though.

Let me know what you find!

decathorpe commented 1 year ago

OK, it looks like there is an actual memory leak. I let the server run for a while, and it gobbled up everything it could of my 32 GB of RAM and 32 GB of swap.

Using weakrefs to debug this won't work, because creating weak references to bytes objects is not supported. And swapping out the runtimes won't work either, because we don't only use IO stuff from tokio, but heavily rely on different channels, RwLocks, Notify, etc. ... which aren't covered by the async_std / tokio compat library.

The thing I did check was keep around the bytes objects with "normal" references, so I could check whether the refcount decreased at any point. However, the refcount of the cached objects remained "one too high" for their entire lifetime, including after my server was closed, and even after the Python asyncio runtime stopped, with them only being deallocated when the Python process itself terminated.

It looks like in very rare cases, some bytes objects have a refcount of 3 immediately after being returned by awaiting my Rust Future, but those additional references might be lingering ones kept around by the tokio task, because the refcount of these objects goes down to 2 immediately.

So I'm still not sure where the additional reference to the bytes objects is coming from ...

I mean, it's entirely possible that the problem is in my code, but I can't see how that would happen, since the bytes object is created and immediately returned, and the function that returns the bytes is only ever called from Python - the only thing between these two things is pyo3_asyncio::tokio::future_into_py 😕

decathorpe commented 1 year ago

Has this resulted in any swapping or problems at runtime aside from high memory usage? I'm not an expert on GC and I don't know much about Python's GC in particular, but I know some garbage collectors don't necessarily clean up resources eagerly.

It looks like the stuff isn't even considered for garbage collection because its refcount isn't zero (as Python's GC uses refcounts to mark things pending deallocation, and only does so if the refcount is zero, which never happens here, so nothing is deallocated).

decathorpe commented 1 year ago

Is it possible that not awaiting the return value of an asyncio.run_coroutine_threadsafe call could cause this?

Right now I'm ignoring the values returned by it, because these callbacks return no data (but I might want to catch exceptions, now that I think of it).

However, I'm struggling to somehow await the awaitable (let's call it py_future) that's returned by this call (appararently it's a concurrent.futures.Future, not asyncio.Future) ... I tried using pyo3_asyncio::into_future_with_locals(&locals, py_future.as_ref(py), but that fails in Python code with TypeError: An asyncio.Future, a coroutine or an awaitable is required. Am I doing something stupid here?

This is as far as I got ...

let py_future = match Python::with_gil(|py| {
    let stream = stream.into_py(py);
    let coro = self.py_tcp_handler.call1(py, (stream, ))?;
    self.run_coroutine_threadsafe.call1(py, (coro, self.py_loop.as_ref(py)))
}) {
    Ok(f) => f,
    Err(error) => {
        // TODO
        continue;
    }
};

Python::with_gil(|py| {
    let locals = pyo3_asyncio::TaskLocals::new(self.py_loop.as_ref(py));

    // this fails with "TypeError: An asyncio.Future, a coroutine or an awaitable is required"
    let future = pyo3_asyncio::into_future_with_locals(&locals, py_future.as_ref(py)).unwrap();

    tokio::spawn(async {
        if let Err(err) = future.await {
            log::error!("TCP Connection handler callback failed: {}", err);
        }
    });
});

decathorpe commented 1 year ago

Oh, well ... I managed to get this to work by using asyncio.wrap_future, but awaiting the result of that asyncio.run_coroutine_threadsafe call made memory leak even faster, instead of making things better. 😭

If you have any insight into what stupid thing I'm doing wrong, please, that would be very much appreciated :(

awestlake87 commented 1 year ago

I should have some time to look at it in a few hrs after work. Not sure how soon I'll be able to respond on it though (working with a broken wrist right now lol). Would you be able to share some more of the project you're working on? It'd help me to see the larger context around these snippets I think

decathorpe commented 1 year ago

Oh no, get well soon! I'm not in a hurry, I'll just probably need to fix the memory leak eventually ...

The project is at https://github.com/decathorpe/mitmproxy_wireguard

The interesting bits are in the python/task and python/tcp_stream modules.

awestlake87 commented 1 year ago

So on Windows 11 Python 3.10 I'm not seeing what you're seeing. Your pyo3-asyncio-memory-leak project didn't ever go above 200Mb memory usage and regularly went back down to ~90 (I watched it all the way through), so at the very least, it doesn't seem to be leaking because of pyo3-asyncio's design. I think we've got to investigate different platforms/versions to figure out what's going on here.

How much memory does your example end up using on your linux machine?

decathorpe commented 1 year ago

Yeah, as I mentioned, the example project where I tried to reproduce the issue doesn't have the same behaviour ... it starts to allocate memory, but it also drops it again. Peak memory usage was probably ~200 MB, but going up and down, like you'd expect from a garbage collected language.

The only difference I can see is the way in which the function / Future is called:

callback coroutine for handling TCP connections is implemented in Python
callback is passed as Python object (PyObject) to the Rust side at Server instantiation
server calls callback coroutine with TcpStream object as an argument for every new TCP connection with asyncio.run_coroutine_threadsafe
callback code (Python) calls read method on TcpStream, which returns a Future that was created with pyo3_asyncio::tokio::future_into_py

Not sure how that could lead to an additional, leaked reference to the bytes objects that are returned by TcpStream.read() though (unless the reference is kept around in some "context" when it shouldn't be?)

awestlake87 commented 1 year ago

Oh ok, I thought you meant it only GC'd after asyncio.run() finished. It was collecting the bytes while that coroutine was running meaning that the ref counts must have reached 0 at some point. I think this means that pyo3-asyncio is handling the references correctly in that situation.

I'll take a closer look at your project tonight and see if anything sticks out.

awestlake87 commented 1 year ago

Just at a glance I don't see anything wrong. There were a few spots I had comments/questions on:

Your use of run_coroutine_threadsafe can probably be replaced with pyo3_asyncio::tokio::into_future (maybe in combination with tokio::spawn) since that's basically what that utility does underneath. It might add some overhead on the rust side, but could be able to give better error feedback since it'll make any exceptions accessible after the task is completed.
https://github.com/decathorpe/mitmproxy_wireguard/blob/467b6967bbd3a665e5185cc88c735f7c030f9029/src/server.rs#L175 - probably unrelated, but I was wondering why this was an issue. OS Signal-related or otherwise?

One thing that you might check is whether any coroutines are throwing exceptions. It sounds like asyncio can keep references to tasks that have uncaught exceptions. You can set debug=True on asyncio or use loop.set_exception_handler(). That's a longshot imo but might be worth trying.

Is there any way we can test the tcp_stream piece of it without installing/configuring wireguard? I built and ran your echo server, but I haven't gone further than that.

Have you thought about / tried using a tool like pympler? It's pretty reasonable to think the leak is in python, but we might be able to confirm that if pympler is tracking leftover bytes objects.

I'm going into surgery tomorrow to get some pins put in my wrist, so it could be awhile before I can come back to this.

decathorpe commented 1 year ago

Thanks for taking a look!

I tried replacing use of run_coroutine_threadsafe with tokio::spawn(pyo3_asyncio::tokio::into_future(coro)), but it didn't change anything.

As for the second point: If I understand correctly, the currently running event loop is associated with a given thread, so you'll have to get it on the "main" thread - otherwise there is "no" currently running loop.

The benchmarks and test scripts all already use asyncio.run(foo, debug=True) and there is no debug output from the Python asyncio runtime ...

I now also tried using gc.set_debug(gc.DEBUG_LEAK), but it didn't print anything. I'll try looking at pympler next.

decathorpe commented 1 year ago

I think we finally figured out the problem. Dropping the let binding for the created PyBytes object solved the issue. It appears that it is never properly cleaned up after the GIL is released. After moving things around a bit, the memory leak now seems to be gone:

https://github.com/decathorpe/mitmproxy_wireguard/commit/d589e0a2d82f39716df560802eef319f487d1c0c

decathorpe commented 1 year ago

Thank you again for your help with debugging this issue, even if it now turns out to have been caused by just let-binding an object where it shouldn't have been ... (though I still don't know why dropping that binding at the end of the scope doesn't decrement the refcount ...)

awestlake87 commented 1 year ago

(though I still don't know why dropping that binding at the end of the scope doesn't decrement the refcount ...)

@decathorpe it's surprising to me as well that this change fixed it. Are you sure it's those lines and not some behaviour in the code around it that changed?

If you've narrowed it down to this change, then it might be worth mentioning this issue to the PyO3 maintainers. Maybe there's something I'm misunderstanding about let bindings, but to me the code looks like it should be equivalent (if not, I've probably written some leaky code somewhere as well). It seems like if there's a drop missing in there somewhere it'd be a compiler bug, which is possible but IMO unlikely. My guess would be that there might be some strange interaction happening between the ordering of the drop and the GIL release.

decathorpe commented 1 year ago

My guess would be that there might be some strange interaction happening between the ordering of the drop and the GIL release.

Yes, that's what I assume is happening. I already found similar reports in PyO3 issue tracker, where memory leaks were able to be worked around by just moving things so that object cleanup happens before the GIL is released, and if I remember correctly, this is not possible to be fixed without major changes in PyO3.

awestlake87 commented 1 year ago

Yes, that's what I assume is happening. I already found similar reports in PyO3 issue tracker, where memory leaks were able to be worked around by just moving things so that object cleanup happens before the GIL is released, and if I remember correctly, this is not possible to be fixed without major changes in PyO3.

It's good that they know about it at the very least. I'm alright with churn if it means we get a fix. That's why it's still version 0 after all.

@decathorpe thanks for following up! Always nice to know more about these quirks.

awestlake87 / pyo3-asyncio

Can `pyo3_asyncio::tokio::future_into_py` return values cause memory leaks? #84

🌍 Environment

💥 Reproducing