elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.69k stars 24.66k forks source link

Report more directly on `transport_worker` event loop latency #113433

Open DaveCTurner opened 6 days ago

DaveCTurner commented 6 days ago

Today we have some mechanisms to detect issues caused by individual tasks blocking the network event loop for unreasonably long durations, but we do not currently have a way to detect event loop blockages that are caused by an unreasonable number of reasonably fast tasks in the queue.

The warnings in the OutboundHandler sort of detect this, but also trigger on external slowness (e.g. network or receiving client) and it's not easy to pin down the cause.

I think we should add a mechanism which occasionally submits tasks to each event loop to empirically measure how long they take to execute, emitting a warning (and a thread dump) if it takes more than some threshold. That would let us reliably detect event loop latency problems without worrying about external factors.

I would suggest integrating it into the existing ThreadWatchdog mechanism which wakes up periodically anyway to emit exactly the right kind of warning.

elasticsearchmachine commented 6 days ago

Pinging @elastic/es-distributed (Team:Distributed)