Open henrikno opened 3 years ago
The agent already exports the number of gcs and their (cumulative) duration via the jvm.gc.count
and jvm.gc.time
metrics (see https://www.elastic.co/guide/en/apm/agent/java/current/metrics.html#metrics-jvm). Shouldn't it be possible to use these for your alerting?
I agree you can do this from existing metrics. But also agree this is a nice separate metric to have. I think it's something that could be added as a derived metric in the dashboard and presented as standard
Is your feature request related to a problem?
Sometimes our processes are struggling with GC, but it's not easy to "spot" or alert on if it's not at the point of OOMing, but it's just so busy doing GC it's effectively not getting it's work done. We'd like to be aware of instances that are in this state, and possibly alert on it.
Describe the solution you'd like
It'd be nice if the APM agent could record an approximation of how much time it has spent on GC.
Elasticsearch has a similar solution https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/monitor/jvm/JvmGcMonitorService.java It's doing it as a scheduled task. I think it's also possible using https://docs.oracle.com/en/java/javase/11/docs/api/jdk.management/com/sun/management/GarbageCollectionNotificationInfo.html
What we want it something we can eventually alert on, e.g. if overhead > 50%, your application isn't getting much real work done.
Describe alternatives you've considered
It's possible to collect it via custom metrics e.g. micrometer, but we have some services where we don't want to add custom code to.
Additional context