elastic / apm-agent-java

https://www.elastic.co/guide/en/apm/agent/java/current/index.html
Apache License 2.0
567 stars 321 forks source link

Measure GC overhead #1791

Open henrikno opened 3 years ago

henrikno commented 3 years ago

Is your feature request related to a problem?

Sometimes our processes are struggling with GC, but it's not easy to "spot" or alert on if it's not at the point of OOMing, but it's just so busy doing GC it's effectively not getting it's work done. We'd like to be aware of instances that are in this state, and possibly alert on it.

Describe the solution you'd like

It'd be nice if the APM agent could record an approximation of how much time it has spent on GC.

Elasticsearch has a similar solution https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/monitor/jvm/JvmGcMonitorService.java It's doing it as a scheduled task. I think it's also possible using https://docs.oracle.com/en/java/javase/11/docs/api/jdk.management/com/sun/management/GarbageCollectionNotificationInfo.html

What we want it something we can eventually alert on, e.g. if overhead > 50%, your application isn't getting much real work done.

Describe alternatives you've considered

It's possible to collect it via custom metrics e.g. micrometer, but we have some services where we don't want to add custom code to.

Additional context

tobiasstadler commented 3 years ago

The agent already exports the number of gcs and their (cumulative) duration via the jvm.gc.count and jvm.gc.time metrics (see https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm). Shouldn't it be possible to use these for your alerting?

jackshirazi commented 2 years ago

I agree you can do this from existing metrics. But also agree this is a nice separate metric to have. I think it's something that could be added as a derived metric in the dashboard and presented as standard