ZRuntime can hang under load due to blocking `flume::send` in queryable callback

eclipse-zenoh / zenoh

zenoh unifies data in motion, data in-use, data at rest and computations. It carefully blends traditional pub/sub with geo-distributed storages, queries and computations, while retaining a level of time and space efficiency that is well beyond any of the mainstream stacks.

https://zenoh.io

Other

1.43k stars 151 forks source link

ZRuntime can hang under load due to blocking `flume::send` in queryable callback #1052

Open chachi opened 4 months ago

chachi commented 4 months ago

Describe the bug

Under heavy load if a queryable using the DefaultHandler gets too far behind, eventually the flume bounded queue will fill up and when it does the next sender.send(t) (from handlers.rs) will block and cause the runtime to get blocked. If more messages come in, they can simultaneously block all runtime threads until the program is simply hung up.

To reproduce

It should work with any queryable that can take some time to process but setting up a router with an S3-backed storage with 5000+ objects and then having it replicate to another router's storage is a good way to trigger this.

It's not deterministic but I was able to reproduce it quite regularly.

System info

S3-connected storage was on an EC2 instance running Ubuntu 22.04
Zenoh 0.11.0-rc.3

Mallets commented 4 months ago

We are going to investigate this. Did you experience the same behaviour in case of a subscriber in addition to a queryable?

chachi commented 4 months ago

Thanks! I didn’t test a subsription but looking at the root cause and the fix I applied to address it, I’d expect the same potential issue.

On Tue, May 28, 2024 at 8:52 AM Luca Cominardi @.***> wrote:

We are going to investigate this. Did you experience the same behaviour in case of a subscriber in addition to a queryable?

— Reply to this email directly, view it on GitHub https://github.com/eclipse-zenoh/zenoh/issues/1052#issuecomment-2135431187, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAE67XWP46OQMYAVR4LJBDZESKZNAVCNFSM6AAAAABIJ5VLU2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZVGQZTCMJYG4 . You are receiving this because you authored the thread.Message ID: @.***>

Mallets commented 4 months ago

Hello @chachi , we are trying to replicate it and we have managed to get something similar. However, we are not quite sure that is actually related to the flume::send in our experiments. Do you have any additional evidence/trace that point to that flume::send?

In addition to that, we have observed different behaviours in case of client and peer modes. Could you provide additional information on your configuration?

chachi commented 4 months ago

@Mallets My evidence was that our router would completely deadlock and a gdb thread apply all backtrace would show at least one thread blocked on flume::send with others blocking on a mutex that the send path was holding IIRC.

I found a few of these "deadlock"/runtime starvation items so you may have also found another one.