DataDog / saluki

An experimental toolkit for building telemetry data planes in Rust.
Apache License 2.0
12 stars 2 forks source link

Improve aggregation and support for aggregated metrics. #216

Closed tobz closed 2 weeks ago

tobz commented 3 weeks ago

Context

In #215, we reworked the aggregate transform to properly consider/handle multiple in-flight buckets and zero-value counters. While this brought aggregation behavior up to parity in terms of ensuring metrics were flushed at the right interval, it did highlight a particularly glaring issue: when aggregating correctly (in terms of matching the Datadog Agent), ADP emits more data overall than the Datadog Agent.

This is due to the fact that the Datadog Agent is actually storing multiple data points per series/sketch. When the aggregate transform flushes, a given context (metric) might be present of two buckets, which means we'll flush two metrics -- one from each bucket -- and both of those will find their way to the Datadog Metrics destination and be sent off. In constant, the Datadog Agent is actually deduplicating these by merging them into a single metric with multiple data points -- one timestamp/value point from each original metric -- and then send that single metric to the serializer and forwarder.

Since the output payload has support for multiple data points per series/sketches, this means we end up sending more data overall, without a super great way to merge things back together in the Datadog Metrics sink without a lot of extra sorting and temporary storage.

We should explore if there's a simple data model change we can make to Metric to better support this, since this would allow us to get back to parity, in terms of output data volume, with the Datadog Agent... and it doesn't hurt that OTLP has the same data model, so we'd be more aligned overall.

This might also be a chance to better optimize the aggregate transform in the process.