elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.18k stars 24.84k forks source link

Improve JVM runtime overhead detection #82549

Open grcevski opened 2 years ago

grcevski commented 2 years ago

Background

When using the hot_threads API we currently show the amount of time each thread spends doing work, however, the amount of runnable time on a thread is also impacted by the overall JVM/System health. For example, the JVM process could be spending a lot of time in GC or another runtime component, or there could be other processes running on the system that are disrupting Elasticsearch's performance.

We currently detect the GC overhead by using the amount of time the process spent in GC vs the total elapsed time. This information is written to the logs if the overhead reaches above certain threshold, however there are few issues with it:

Proposed improvement

The ultimate metric on how much our Elasticsearch application was blocked, by other things (GC, JVM, OS, noisy neighbours) beyond our control, can be calculated by measuring the time dilation observed when we set to sleep for a given period of time. For example, if we were set to sleep for 500ms, but when we wake up we detect that we slept for 550ms, there was something going on during our sleep that caused us to sleep for 50ms extra, or 10% more.

This extra time dilation of 50ms is the overall overhead introduced by everything else in the platform we are running on.

The proposal is to run a service thread (much like the GC monitor thread) that will loop and sleep for a given interval. On each wake-up it will calculate the time dilation, and by that, effectively derive the total runtime overhead and store it in the service metrics.

When the calculated runtime overhead is above certain threshold we can log it in the logs, just like we do for the GC overhead at the moment, while we would also expose it on calls to hot_threads as a separate overall overhead line. This way, when we are looking at a thread stack and we are wondering why it took so long at a specific operation, we can correlate this additional metric to see if the time spent is actually outside of the application realm.

elasticmachine commented 2 years ago

Pinging @elastic/es-core-infra (Team:Core/Infra)