elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
https://www.elastic.co/apm
Apache License 2.0
384 stars 114 forks source link

Support OpenTelemetry `tracestate` header for consistent head-based sampling #827

Open StephanErb opened 1 year ago

StephanErb commented 1 year ago

From the Elastic documentation:

Head-based sampling is implemented in the APM agents and SDKs, and requires the sample rate to be propagated between services and the APM Server. This functionality is not currently supported by OpenTelemetry, which results in inaccurate APM throughput, latency, and error metrics. OpenTelemetry users should consider using tail-based sampling instead.

This is by now outdated as OpenTelmetry has tracestate support, even though in a slightly different form than Elastic:

This document specifies an approach based on an “r-value” and a “p-value”. At a very high level, r-value is a source of randomness and p-value encodes the sampling probability. A context is sampled when p <= r.

Both fields are propagated via the OpenTelemetry tracestate under the ot vendor tag using the rules for tracestate handling. Both fields are represented as unsigned decimal integers requiring at most 6 bits of information.

This allows Trace consumers to correctly count spans simply by interpreting the p-value on a given span.

Asks

Context

axw commented 1 year ago

@StephanErb thanks for opening this! This has been on my mind, but hadn't gotten around to opening the issue yet.

This is partially done. The main missing part your first point about the client libraries populating both tracestate keys. I think it would also be useful to have OTel Sampler implementations that produce/handle both tracestate keys.

Elastic APM Server should use the OpenTelemetry tracestate header to estimate the full throughput metrics if available.

FYI this was implemented in v8.8.0: https://github.com/elastic/apm-server/pull/10309. Seems to be missing from the release notes.