[APM] Service instance runtime metrics

alex-fedotyev commented 4 years ago

Summary of the problem (If there are multiple problems or use cases, prioritize them) Currently APM agents collect various system and runtime metrics, which could help detecting resource saturation or configuration issues. Visualizing these metrics for every agent type would make this information actionable during performance issues troubleshooting.

User stories

As App Ops, I need to correlate service performance with system and runtime performance.
As App Ops, I need to be able to identify when specific instance is performing differently than the majority of other instances.
As App Ops, I need to quickly identify which runtime metrics are trending out of normal at the same time as service is experiencing issues.

List known (technical) restrictions and requirements Has to work with different agent types and appreciate that each runtime has its own specific runtime metrics.

If in doubt, don’t hesitate to reach out to the #observability-design Slack channel.

elasticmachine commented 4 years ago

Pinging @elastic/observability-design (design)

sorenlouv commented 4 years ago

We have three issues for runtime metrics:

Design issue: https://github.com/elastic/apm/issues/301 (this)
Meta issue (?): https://github.com/elastic/apm/issues/224
Implementation issue: https://github.com/elastic/kibana/issues/63573

Are all of them needed? I'm not sure what the purpose of the meta issue.

sorenlouv commented 4 years ago

Visualizing these metrics for every agent type would make this information actionable during performance issues troubleshooting.

What are "these metrics"? Currently we show CPU and memory metrics for each agent (except java agent).

Do we want to keep showing metrics as averages across all hosts / vms / containers or are we going to show them per container like we do for java?

elastic / apm

[APM] Service instance runtime metrics #301