elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
373 stars 111 forks source link

Support Prometheus metric libraries with APM agents #355

Open felixbarny opened 3 years ago

felixbarny commented 3 years ago

The background of this is that we want to support sending custom metrics with our agents. We don't want to create another metrics API though and thus we're seeking to integrate with existing ones. Each language has its own favorite metric libraries. But Prometheus has good cross-language coverage and is quite popular in all of them so it seems like a good fit to align on cross-agent.

Prometheus doesn't have to be the one and only API we support. In fact, the Java agent already supports the Micrometer metric registry which is quite popular in that ecosystem. It's up to each agent team to decide on the priority and the choice of other metric APIs they want to support.

This issue tracks writing up a cross agent spec and implementing a reference implementation for supporting the Prometheus client library.

Instead of the pull-based model that is typical for Prometheus, the Elastic APM integration will send the metrics that are registered in the Prometheus metric registry directly from the application to the APM Server intake API.

Histograms will not be supported in the first iteration. After APM Server adds support for histograms (https://github.com/elastic/apm-server/issues/3195), there will be a follow up to support them as well.

Agents that support auto instrumentation should automatically plug into the Prometheus client so that users don't need to configure anything in order for the metrics to be sent.

TODOs

Spec issue

Agent issues

exekias commented 3 years ago

Instead of the pull-based model that is typical for Prometheus, the Elastic APM integration will send the metrics that are registered in the Prometheus metric registry directly from the application directly to the APM Server intake API.

Any thoughts on what will be the experience when users are collecting Prometheus metrics from both instrumented applications and other services just exposing these? I understand for the later users would be using Elastic Agent with autodiscover for all Prometheus endpoints. This would lead to duplicating the data I guess, which may be ok?

felixbarny commented 3 years ago

Yep, that would lead to duplicated metrics in different indices. It probably makes sense for agents to offer an option to disable Prometheus metric collection. Not sure if we need to have a cross-agent consistent way of doing that. In the Java agent, the easiest way to implement that would be to make users set disable_instrumentations=prometheus.

But that makes me realize that we should make sure that metrics collected via APM Agents are consistent with the format of the Metricbeat Prometheus collector.

exekias commented 3 years ago

But that makes me realize that we should make sure that metrics collected via APM Agents are consistent with the format of the Metricbeat Prometheus collector.

💯

In that sense, I'm wondering, would it make sense for APM agents to inject the APM related metadata into Prometheus labels? I guess you are not really storing that as a Prometheus label, but using some other ECS fields.

felixbarny commented 3 years ago

inject the APM related metadata

Could you elaborate on what you mean with APM related metadata? Do you mean host/Docker/k8s/cloud/service metadata? Agents only send that once with each request to APM Server. The Server then folds the metadata to each event (such as a metricset) that's sent in the same request. I guess we'd just map the regular Prometheus labels to the ECS field labels.*.

exekias commented 3 years ago

Thanks for the explanation, I was thinking aloud about this part from https://github.com/elastic/observability-dev/issues/1178:

Currently in order to monitor Prometheus client metrics customers has to either export Prometheus metrics using prometheus module, which lacks deep correlation with APM via ECS fields.

I guess this refers to the service fields, it would be nice if we could still attach the right fields to the metrics when we are under the scenario I explained. Anyway, I agree that injecting these into Prometheus labels may be challenging or not worth it.

alex-fedotyev commented 3 years ago

I am curious what happens with duplicate metrics when we get to datastreams: https://docs.google.com/document/d/1y56a9fjkLi6Zen5qGC_JKYM9ljYpBA5W0fgdWilcwYc/edit

Today when I enable apm- and metrics- on waffle map, I end up seeing duplicate instances.

alex-fedotyev commented 3 years ago

For counters and timers, decide on whether to report the difference/delta since the last report or whether to send up cumulative values

Regarding difference/delta vs actual value, I think it make sense to align with how integrations collect those metrics. I am wondering how to simplify visualization of custom metrics and making this easier than today (we already offer TSVB, Lens, Metrics Explorer, Inventory waffle map already to work with custom metrics).

cyrille-leclerc commented 3 years ago

👍 on @alex-fedotyev , can we offer the same user experience via Elastic APM and via Metricbeat Prometheus? I particularly have in mind to be aligned on the histogram support.

A difference I see is to question the idea to prefix the metric name as Metricbeat does it for the prometheus integration prefixing by prometheus.

nicholas-r-king commented 11 months ago

Any movement on this? This seems dead even though https://github.com/elastic/apm-agent-python/issues/1005 was completed successfully for Python. Why was this stalled for all other agents?

gregkalapos commented 11 months ago

OTel metrics changed the priority of this: https://github.com/elastic/apm/issues/691

As OTel became more popular for metrics as well, we focused on supporting OTel metrics, instead of going for Prometheus. There is some overlap - e.g. in Java there is a prometheus exporter for OTel, that's mentioned in our docs.

But to address the question directly:

Why was this stalled for all other agents?

Due to OTel metrics getting more important, so we focus on that.

nicholas-r-king commented 11 months ago

That doesn't seem entirely true as there doesn't seem to have been any movement on any of those tickets either. No milestones, no branches, and inactive since Nov 2022.

gregkalapos commented 11 months ago

Is there anything we can help you with @nicholas-r-king? Any specific missing feature in any specific agent?