[APM] Runtime metrics for all agents in the APM App

nehaduggal commented 4 years ago

Summary of the problem Most of the APM agents collect runtime metrics data which is available for customers to visualize via the apm-contrib dashboards. Java agent is the only agent that surfaces the runtime performance data on a JVMs tab for each instance of the service that is reporting. We should have a similar page for all the other agents to surface the metrics that we collect in the curated UI.

List known (technical) restrictions and requirements

For JVM page specifically we chose the tabular approach that shows individual instances instead of a chart with different line graphs to capture each instance because the number of instances reporting can be large. This assumption is probably true for all other agents. We should be able to surface the runtime performance captured by the agents and displayed in the APM App in a way that is compatible for each language ecosystem.

References

elasticmachine commented 4 years ago

Pinging @elastic/observability-design (design)

alex-fedotyev commented 4 years ago

Here is a suggestion how we could design it by leveraging existing observability UI components:

Visualize instances using waffle explorer from metrics UI.
Allow users to see how the instances are performing by bringing multiple metrics:
- Transaction metrics (requests/min, response time, errors rate).
- Runtime metrics like GC%, Gen 0 size, etc. (will be slightly different per runtime).
- Container metrics (if available).
- Host metrics (if available).
Allow users to group by multiple dimensions:
- APM service attributes (service version, runtime version, cloud availability zone, etc).
- Container attributes like image name.
- K8s attributes like availability zone, pod name, etc.
- Datacenter (for on-premises it would be nice to determine it based on IP masks or host names naming patters, but that might be manual) or cloud datacenter

This design would allow leveraging familiar design where it is relevant (service instances are similar to infrastructure).

Linking from service view to the infra metrics would provide benefits to the SRE's to understand service performance across the farm(s) and how it relates to performance of infrastructure which hosts it, especially during the issues.

Test - Service Infrastructure

alex-fedotyev commented 4 years ago

@sorantis brought couple interesting points about the proposal above:

What kind of drill down would be expect from the list of instances? Would it go to APM page for instance details? How would it link to the infrastructure UI like container or host view?
- Idea add an anomaly score/severity to the list of metrics for each instance similar to duration or error rates.

graphaelli commented 4 years ago

cc @lreuven

alex-fedotyev commented 4 years ago

Added design issue: https://github.com/elastic/apm/issues/301

formgeist commented 3 years ago

I'm bringing this back up as an opportunity to implement an updated metrics experience in the near-term which adds service instance level breakdown ability and adds the additional metrics that are listed for each agent below. I imagine there are a few agents missing on the list since this issue was initially created.

With the switch to Elastic Charts, there should be no blockers on the visualization part. From a design perspective, there might be some guidance on the color palettes and how the visualizations should be put together and laid out. Additionally, I imagine there should be a suggested layout for the overview/list of instances similar to the Java JVM metrics experience.

Overall I think the UI team should be able to pick this up in https://github.com/elastic/kibana/issues/63573 and ask for guidance in implementation from either design or agents.

Long-term service instance metrics experience will be explored and design in #301 in partnership with @alex-fedotyev

Thoughts? @nehaduggal @sqren @alex-fedotyev

Node

Memory: RSS
Memory: Total Heap Allocated
Memory: Heap Used
Event loop delay (ms)
Active handles
Active requests
CPU user/system time/utilization
Garbage collection(Scavenge, MarkSweepCompact, Incremental marking) - {Stretch Goal}

Ruby

Time in Garbage collection
Frequency of GC
Memory usage
Thread count

Python

Garbage collection
Memory usage(existing memory usage graph on the apm-contrib dashboard)
I/O
Thread count (Gauge)
Context switches (Counter)
Voluntary
Involuntary
Open file handles (Gauge)

Go

Metrics are already captured. Todo: Custom dashboard in the apm-contrib repo.

PHP

No additional metrics defined

.Net

No additional metrics defined

nehaduggal commented 3 years ago

I would rather have us reconcile the new workflows that are being designed with the current UI we have for metrics instead of tackling this. multiple times. Once we have the UI, we can work on on-boarding metrics from all other agents.

elastic / apm

[APM] Runtime metrics for all agents in the APM App #224