Today we have some mechanisms to detect issues caused by individual tasks blocking the network event loop for unreasonably long durations, but we do not currently have a way to detect event loop blockages that are caused by an unreasonable number of reasonably fast tasks in the queue.
The warnings in the OutboundHandler sort of detect this, but also trigger on external slowness (e.g. network or receiving client) and it's not easy to pin down the cause.
I think we should add a mechanism which occasionally submits tasks to each event loop to empirically measure how long they take to execute, emitting a warning (and a thread dump) if it takes more than some threshold. That would let us reliably detect event loop latency problems without worrying about external factors.
I would suggest integrating it into the existing ThreadWatchdog mechanism which wakes up periodically anyway to emit exactly the right kind of warning.
Today we have some mechanisms to detect issues caused by individual tasks blocking the network event loop for unreasonably long durations, but we do not currently have a way to detect event loop blockages that are caused by an unreasonable number of reasonably fast tasks in the queue.
The warnings in the
OutboundHandler
sort of detect this, but also trigger on external slowness (e.g. network or receiving client) and it's not easy to pin down the cause.I think we should add a mechanism which occasionally submits tasks to each event loop to empirically measure how long they take to execute, emitting a warning (and a thread dump) if it takes more than some threshold. That would let us reliably detect event loop latency problems without worrying about external factors.
I would suggest integrating it into the existing
ThreadWatchdog
mechanism which wakes up periodically anyway to emit exactly the right kind of warning.