elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.68k stars 24.66k forks source link

Hot threads cpu time related to stack traces #81006

Open henningandersen opened 2 years ago

henningandersen commented 2 years ago

The _nodes/hot_threads API is a valuable tool for diagnosing performance issues. Today, it first finds out how much cpu-time each thread uses and then afterwards samples same threads to provide a rough profiling of where time is spent. With this approach, there is a risk that the cpu usage reported is unrelated to the stack traces, in the extreme it could report 100% cpu usage but the stack traces are waiting for IO, mutexes or just back waiting on the thread pool queue.

We could do the sampling of thread stacks while measuring the cpu usage. This does add a risk that the sampling affects the cpu usage. For instance, sampling thread stacks require a safepoint and this could reduce the cpu usage artificially of some threads.

To have the best of both worlds, I propose to take and report 2 cpu-usages, one taken before sampling thread stacks (and thus unaffected) and one taken during thread sampling (potentially affected). This would be something like (with default request parameters):

  1. snapshot cpu-time of all threads (cputime1)
  2. sleep 500ms
  3. snapshot cpu-time of all threads (cputime2)
  4. take thread stack traces, sleeping 50 ms (to make the cpu usage comparable to the previous one, with 10 samples that would total 500ms).
  5. snapshot cpu-time of all threads(cputime3)

Then report before-cpu-time = cputime2-cputtime1 and during-cpu-time = cputime3-cputtime2

cc: @grcevski

elasticmachine commented 2 years ago

Pinging @elastic/es-core-infra (Team:Core/Infra)