elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.22k stars 524 forks source link

CGroup memory utilization metric in stack monitoring for integration server is not available on ESS #8596

Open lahsivjar opened 2 years ago

lahsivjar commented 2 years ago

APM Server version (apm-server version): 8.3.*

Description of the problem including expected versus actual behavior: Stack monitoring should show memory utilization for integration server

Steps to reproduce:

  1. On ESS, open stack monitoring
  2. Open Integrations server overview
  3. Observe memory panel in Integrations Server - Resource Usage

Other details

The metric seems to plot beats_stats.metrics.beat.cgroup.memory.mem.usage.bytes but as per metricbeat documents the correct field should be either beats_stats.metrics.beat.cgroup.mem.usage.bytes or beat.stats.cgroup.memory.mem.usage.bytes

simitt commented 2 years ago

This most certainly will require a fix in the Kibana code where the stack monitoring part lives.

kruskall commented 2 years ago

I was looking into this but it seems the memory limit metric has the same issue: The metric seems to plot beats_stats.metrics.beat.cgroup.memory.mem.limit.bytes but as per metricbeat documents the correct field should be either beats_stats.metrics.beat.cgroup.mem.limit.bytes or beat.stats.cgroup.memory.mem.limit.bytes.

I've opened a PR to address both.

simitt commented 1 year ago

With the help of @miltonhultgren and @fearful-symmetry the root cause was identified as cgroups V2 metric limits currently not being reported for the stats HTTP endpoint, see https://github.com/elastic/elastic-agent-system-metrics/issues/64

miltonhultgren commented 1 year ago

It could be that Kibana also doesn't manage this correctly, I took a brief look at @kruskall 's PR and it shows some places where we don't read from the new Metricbeat format, but I wanted the data fixed before so I could verify that!

simitt commented 1 year ago

Moving this to the backlog until the underlying issues have been resolved.