Improve JVM runtime overhead detection

Background

When using the hot_threads API we currently show the amount of time each thread spends doing work, however, the amount of runnable time on a thread is also impacted by the overall JVM/System health. For example, the JVM process could be spending a lot of time in GC or another runtime component, or there could be other processes running on the system that are disrupting Elasticsearch's performance.

We currently detect the GC overhead by using the amount of time the process spent in GC vs the total elapsed time. This information is written to the logs if the overhead reaches above certain threshold, however there are few issues with it:

The current thresholds of GC overhead are set too high (25%), a well behaved Java process shouldn't be spending more than 5% in GC. At the same time, writing to the log on lower threshold would mean a lot of noise in certain situations. A better approach would be to expose this metric when we call hot_threads, whatever the current GC overhead is.
The way we calculate GC overhead doesn't work for concurrent collectors, e.g. Shenandoah GC, ZGC, Azul C3. The amount of time spent in GC with the concurrent collectors will be always large, because the GC is happening while the application threads are running and what really matters is the amount of time our Java program is blocked because of GC, which with these collectors should be negligible.

Proposed improvement

The ultimate metric on how much our Elasticsearch application was blocked, by other things (GC, JVM, OS, noisy neighbours) beyond our control, can be calculated by measuring the time dilation observed when we set to sleep for a given period of time. For example, if we were set to sleep for 500ms, but when we wake up we detect that we slept for 550ms, there was something going on during our sleep that caused us to sleep for 50ms extra, or 10% more.

This extra time dilation of 50ms is the overall overhead introduced by everything else in the platform we are running on.

The proposal is to run a service thread (much like the GC monitor thread) that will loop and sleep for a given interval. On each wake-up it will calculate the time dilation, and by that, effectively derive the total runtime overhead and store it in the service metrics.

When the calculated runtime overhead is above certain threshold we can log it in the logs, just like we do for the GC overhead at the moment, while we would also expose it on calls to hot_threads as a separate overall overhead line. This way, when we are looking at a thread stack and we are wondering why it took so long at a specific operation, we can correlate this additional metric to see if the time spent is actually outside of the application realm.

elastic / elasticsearch

Improve JVM runtime overhead detection #82549

Background

Proposed improvement