elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.72k stars 24.67k forks source link

Watchdog mechanism for network threads #108710

Closed DaveCTurner closed 1 week ago

DaveCTurner commented 4 months ago

It's important for us to be able to detect situations where a network thread spends too long doing non-network things. Today we log some warnings in this area but they're not 100% useful (e.g. the OutboundHandler warnings include the time spent doing other things while the outbound channel is unwritable). Making this stuff more granular is hard, especially if we don't want to disturb the performance of these performance-critical threads.

Rather than pushing more timing and logging work onto these threads, it seems like a better approach would be to build a separate watchdog mechanism that runs occasionally (say, every 15s) and ensures that every network thread is either idle or completed at least one task since the last time the watchdog ran. Built right, I reckon we could make each thread report its progress by simply adjusting a volatile long field (maybe reserving one bit as an idle flag) which seems like it should be adequately performant.

elasticsearchmachine commented 4 months ago

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner commented 4 months ago

WIP solution at #109204