DataDog / glommio

Glommio is a thread-per-core crate that makes writing highly parallel asynchronous applications in a thread-per-core architecture easier for rustaceans.
Other
2.93k stars 161 forks source link

RawTask memory leak #636

Open greatest-ape opened 6 months ago

greatest-ape commented 6 months ago

I've received a report of memory leaks in aquatic_ws. When running the application under heaptrack, it appears that tasks started to handle connections are in fact never freed. After looking over my application code, I simply can't find a logic bug that would cause the tasks not to finish in the tested scenario. More memory is leaked when more connections were created, so this likely isn't explained by fixed overhead.

Connection tasks are spawned and detached in https://github.com/greatest-ape/aquatic/blob/2d959ee8fc4e8acdd4e345917a7d4770c6977b6f/crates/ws/src/workers/socket/mod.rs#L162. The tasks then spawn and await tasks of their own in https://github.com/greatest-ape/aquatic/blob/2d959ee8fc4e8acdd4e345917a7d4770c6977b6f/crates/ws/src/workers/socket/connection.rs#L197.

I've attacked a screenshot of the heaptrack flamegraph for leaked memory. I can send the full heaptrack file (180 MB) if you're interested.

aquatic_ws_possible_mem_leak

glommer commented 6 months ago

Definitely sounds like a leak.

A couple of questions: you don't keep a reference to the connection anywhere. Could it be that this is just new connections being opened?

Assuming it is not - I wonder if there's a bug that is triggered by race, where the future that is not ready is not canceled properly at the runtime level. Would you be able to provide a minimal reproducer ?

greatest-ape commented 6 months ago

I ran the application with heaptrack locally and waited for around a minute after quitting the load test program creating connections before quitting the tracker, so I have a hard time seeing how it could be due to new connections - if it’s about data still in kernel buffers or similar, the task should finish anyway when it notices that the client socket is closed. I can also confirm that the leak occurs even if connection count reaches zero, e.g., connection clean up code is called as many times as connections are opened. In other words, the race future returns, but there is still task state that is not cleaned up, so it might very well be a bug related to it. Could it be related to the tasks being in different priority queues?

I do store task references in a sense, just not the TaskHandle. Since I want to do cleanup after connections are closed, I don’t want to force quit the whole task, but send a message to it instead, so storing the TaskHandle doesn’t bring a lot with the current design.

I’m unsure if I’ll be able to provide a minimal reproducer.

greatest-ape commented 6 months ago

The leak is “fixed” by awaiting race on plain async blocks instead of on spawned tasks: https://github.com/greatest-ape/aquatic/commit/188da135ab6fa296c3aaee8f66badbf871e94d20

FYI running the tasks on the same queue didn’t fix the issue.

vlovich commented 3 months ago

@greatest-ape is this related to https://github.com/DataDog/glommio/issues/448 ?