Aggregation for Counters

pedro-stanaka commented 3 months ago

Summary

This pull request introduces a improvement by adding metric aggregation capabilities. This new feature aims to optimize performance by minimizing network traffic and (hopefully memory allocations).

Benchmark results

Sending to Dev Null

Click to see full logs **NO Aggregation**: execution rates varied between **170.069k to 172.489k i/s**. **With Aggregation**: the execution rates improved, ranging from **174.911k to 175.745k i/s**.

``` Benchmark Results Without STATSD_ENABLE_AGGREGATION Run 1: Warm-up: 16.951k i/100ms Execution: 170.123k (± 0.9%) i/s - 864.501k in 5.082029s Run 2: Warm-up: 17.271k i/100ms Execution: 172.489k (± 0.8%) i/s - 863.550k in 5.006710s Run 3: Warm-up: 17.096k i/100ms Execution: 170.069k (± 0.9%) i/s - 854.800k in 5.026605s With STATSD_ENABLE_AGGREGATION=true Run 1: Warm-up: 17.586k i/100ms Execution: 174.911k (± 1.3%) i/s - 879.300k in 5.028007s Run 2: Warm-up: 17.580k i/100ms Execution: 175.745k (± 2.3%) i/s - 879.000k in 5.004536s Run 3: Warm-up: 17.527k i/100ms Execution: 175.400k (± 2.8%) i/s - 876.350k in 5.000560s ```

Sending to local UDP

===== UDP_sync throughtput (5 threads) =====
events: 128152.0/s
===== UDP_batched throughtput (5 threads) =====
events: 510461.0/s
===== Aggregation_Enabled throughtput (5 threads) =====
events: 570823.6/s

Vernier Profiles

casperisfine commented 3 months ago

Are you sure your profiles are correct? Because CounterAggregator is nowhere to be seen in the profile which is suspicious.

As for the feature more generally, I'm not very enthusiastic about this, it complexifies the code quite a bit for just counters, and in process based, so there isn't that much to aggregate.

IMO it would be much better to explore a dogstatsd like solution that can aggregate metrics from the whole host: https://docs.datadoghq.com/developers/dogstatsd/data_aggregation/. The datadog agent is open source, so we might be able to use it with little to no modifications: https://github.com/DataDog/datadog-agent/tree/main

pedro-stanaka commented 3 months ago

Are you sure your profiles are correct? Because CounterAggregator is nowhere to be seen in the profile which is suspicious.

As for the feature more generally, I'm not very enthusiastic about this, it complexifies the code quite a bit for just counters, and in process based, so there isn't that much to aggregate.

IMO it would be much better to explore a dogstatsd like solution that can aggregate metrics from the whole host: https://docs.datadoghq.com/developers/dogstatsd/data_aggregation/. The datadog agent is open source, so we might be able to use it with little to no modifications: https://github.com/DataDog/datadog-agent/tree/main

It is hard to see, I agree, but now it is not so costly to increment that is the reason it almost does not show:

Regarding the other remarks, we already have aggregation enabled on the server side. The problem is when even during a single request an process sends too many samples for counters, it will overload the network in return of nothing.

casperisfine commented 3 months ago

It is hard to see, I agree, but now it is not so costly to increment that is the reason it almost does not show:

Ah, it's because the benchmark increment without tags or anything and it's always the same key.

Regarding the other remarks, we already have aggregation enabled on the server side. The problem is when even during a single request an process sends too many samples for counters

I'm talking about a node local relay. That's how dogstatsd is meant to be deployed.

pedro-stanaka commented 3 months ago

I'm talking about a node local relay. That's how dogstatsd is meant to be deployed.

We already talked in private, but just to leave documented here. We are already aggregating at the node level (with a DaemonSet), but since pod communicate with Daemonset via IP (not loopback) we still get aggressive network usage and also CPU usage on the Daemonset to handle all those samples.

ahayworth commented 3 months ago

I'm excited that we're pursuing aggregation in the client (finally)! I think there are a lot of gains to be had for only a modest overhead cost, as your initial benchmarks show.

One thing that stood out to me initially is that your background flush thread might need some form of health-check to keep going after a fork - for example, there is a fairly involved healthcheck of the dispatch thread for the BatchedUDPSink's Dispatch class (which contains a thread that clears out the batch queue). This PR might need something similar to survive after a fork.

The tests for the batched and un-batched UDP sinks have test cases that exercise various forking scenarios, so that might be an interesting way to validate whether or not you need to add similar mechanisms (and if so, if those mechanisms work).

casperisfine commented 3 months ago

as your initial benchmarks show.

This benchmark is to be taken with a huge grain of salt in this context, because it wasn't designed for aggregation. The only counter in the benchmark has no tag at all, so it aggregates extremely well. In a real world scenario aggregation impose a noticeable overhead both in compute and memory.

ahayworth commented 3 months ago

This benchmark is to be taken with a huge grain of salt in this context, because it wasn't designed for aggregation. The only counter in the benchmark has no tag at all, so it aggregates extremely well. In a real world scenario aggregation impose a noticeable overhead both in compute and memory.

That's a good point; it's the ideal case for aggregation. Perhaps a benchmark with more realistic data might be useful here - and perhaps it would be useful more generally for the library?

casperisfine commented 3 months ago

Perhaps a benchmark with more realistic data might be useful here - and perhaps it would be useful more generally for the library?

Yes it wouldn't hurt. The current benchmark was tailored to bench the packet serialization code.

That said, micro benchmarks will always be lies, especially in this case because one of my main concerns is non-local performance impact caused by the cache stressing the GC, which you can't really reflect in a benchmark, but only really measure in production with some sort of A/B test.

ahayworth commented 3 months ago

performance impact caused by the cache stressing the GC, which you can't really reflect in a benchmark, but only really measure in production with some sort of A/B test.

Agreed, that is going to be difficult to suss out from a benchmark. Thankfully we do have an ideal ~torture chamber~ stress test for statsd! 😄

pedro-stanaka commented 2 months ago

Closing and marking as superseded by #371, where we have a more complete approach using value packing for distributions/timings and also aggregate gauges.

Shopify / statsd-instrument