Open chachi opened 4 months ago
We are going to investigate this. Did you experience the same behaviour in case of a subscriber in addition to a queryable?
Thanks! I didn’t test a subsription but looking at the root cause and the fix I applied to address it, I’d expect the same potential issue.
On Tue, May 28, 2024 at 8:52 AM Luca Cominardi @.***> wrote:
We are going to investigate this. Did you experience the same behaviour in case of a subscriber in addition to a queryable?
— Reply to this email directly, view it on GitHub https://github.com/eclipse-zenoh/zenoh/issues/1052#issuecomment-2135431187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE67XWP46OQMYAVR4LJBDZESKZNAVCNFSM6AAAAABIJ5VLU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVGQZTCMJYG4 . You are receiving this because you authored the thread.Message ID: @.***>
Hello @chachi , we are trying to replicate it and we have managed to get something similar. However, we are not quite sure that is actually related to the flume::send
in our experiments. Do you have any additional evidence/trace that point to that flume::send
?
In addition to that, we have observed different behaviours in case of client
and peer
modes. Could you provide additional information on your configuration?
@Mallets My evidence was that our router would completely deadlock and a gdb thread apply all backtrace
would show at least one thread blocked on flume::send
with others blocking on a mutex that the send
path was holding IIRC.
I found a few of these "deadlock"/runtime starvation items so you may have also found another one.
Describe the bug
Under heavy load if a queryable using the
DefaultHandler
gets too far behind, eventually theflume
bounded queue will fill up and when it does the nextsender.send(t)
(from handlers.rs) will block and cause the runtime to get blocked. If more messages come in, they can simultaneously block all runtime threads until the program is simply hung up.To reproduce
It should work with any queryable that can take some time to process but setting up a router with an S3-backed storage with 5000+ objects and then having it replicate to another router's storage is a good way to trigger this.
It's not deterministic but I was able to reproduce it quite regularly.
System info