Using aggregation and value packing to reduce network chat

pedro-stanaka commented 2 months ago

This is an extension of the work in #366, started as experimentation, but showed great potential in reducing not only network chat, but also CPU usage in the Ruby applications we use in production.

pedro-stanaka commented 2 months ago

@ahayworth @casperisfine I have updated the PR addressing most, if not all, points from your reviews. Let me know what you think of this version. Going to wait for #368 so I can benchmark this properly.

pedro-stanaka commented 2 months ago

One question I had though - is there any value-packing supported for histograms? I noticed we don't call the aggregator for that at all. It's not used heavily internally as far as I can tell, but it did jump out at me.

@ahayworth we're indeed value-packing distributions/histograms. This is done in these places:

We build up the values array for all distribution/histogram observations - https://github.com/Shopify/statsd-instrument/blob/7d49293c16a1d6d7b889c0fecd36f8e8ad97ed59/lib/statsd/instrument/aggregator.rb#L150-L151
We flush using the datagram builder new method distribution_value_packed. - https://github.com/Shopify/statsd-instrument/blob/7d49293c16a1d6d7b889c0fecd36f8e8ad97ed59/lib/statsd/instrument/aggregator.rb#L188-L194

casperisfine commented 2 months ago

FYI:

# frozen_string_literal: true

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'benchmark-ips'
end

class AggregationKey
  attr_reader :name, :tags, :no_prefix, :type

  def initialize(name, tags, no_prefix, type)
    @name = name
    @tags = tags
    @no_prefix = no_prefix
    @type = type
  end

  def ==(other)
    other.is_a?(self.class) &&
      @name == other.name &&
      @tags == other.tags &&
      @no_prefix == other.no_prefix &&
      @type == other.type
  end
  alias_method :eql?, :==

  def hash
    [@name, @tags, @no_prefix, @type].hash
  end
end

class PrecomputedAggregationKey < AggregationKey
  attr_reader :hash

  def initialize(name, tags, no_prefix, type)
    super
    @hash = [@name, @tags, @no_prefix, @type].hash
  end
end

StructKey = Struct.new(:name, :tags, :no_prefix, :type)

registry = {}
10.times do |i|
  registry[AggregationKey.new("some.key", "some_tag:#{i}", false, :c)] = true
end

10.times do |i|
  registry[PrecomputedAggregationKey.new("some_other.key", "some_tag:#{i}", false, :c)] = true
end

10.times do |i|
  registry[StructKey.new("struct.key", "some_tag:#{i}", false, :c)] = true
end

poro = AggregationKey.new("some.key", "some_tag:1", false, :c)
precomputed = PrecomputedAggregationKey.new("some.key", "some_tag:1", false, :c)
struct = StructKey.new("some.key", "some_tag:1", false, :c)

Benchmark.ips do |x|
  x.report("PORO") { registry[poro] }
  x.report("precomputed") { registry[precomputed] }
  x.report("Struct") { registry[struct] }
  x.compare!(order: :baseline)
end

$ ruby /tmp/hash.rb
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23]
Warming up --------------------------------------
                PORO   375.256k i/100ms
         precomputed   586.333k i/100ms
              Struct   640.637k i/100ms
Calculating -------------------------------------
                PORO      3.750M (± 1.0%) i/s -     18.763M in   5.003452s
         precomputed      5.924M (± 1.2%) i/s -     29.903M in   5.048967s
              Struct      6.424M (± 1.4%) i/s -     32.672M in   5.087137s

Comparison:
                PORO:  3750348.0 i/s
              Struct:  6423931.0 i/s - 1.71x  faster
         precomputed:  5923525.8 i/s - 1.58x  faster

$ ruby --yjit /tmp/hash.rb
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) +YJIT [arm64-darwin23]
Warming up --------------------------------------
                PORO   528.414k i/100ms
         precomputed   608.157k i/100ms
              Struct   600.500k i/100ms
Calculating -------------------------------------
                PORO      5.653M (± 1.5%) i/s -     28.534M in   5.049266s
         precomputed      6.675M (± 5.2%) i/s -     33.449M in   5.028044s
              Struct      6.164M (± 4.6%) i/s -     31.226M in   5.076908s

Comparison:
                PORO:  5652529.0 i/s
         precomputed:  6674636.5 i/s - 1.18x  faster
              Struct:  6164451.7 i/s - 1.09x  faster

TL;DR; Precomputed PORO is a little bit faster with YJIT, but Struct wins without YJIT. Difference between the two isn't huge, so I think precomputed is fine.

casperisfine commented 2 months ago

If I change the benchmark to also include the key allocation (which is more realistic)

Benchmark.ips do |x|
  x.report("PORO") { registry[AggregationKey.new("some.key", "some_tag:1", false, :c)] }
  x.report("precomputed") { registry[PrecomputedAggregationKey.new("some.key", "some_tag:1", false, :c)] }
  x.report("Struct") { registry[StructKey.new("some.key", "some_tag:1", false, :c)] }
  x.compare!(order: :baseline)
end

$ ruby /tmp/hash.rb
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23]
Warming up --------------------------------------
                PORO   276.197k i/100ms
         precomputed   327.271k i/100ms
              Struct   415.442k i/100ms
Calculating -------------------------------------
                PORO      2.816M (± 1.4%) i/s -     14.086M in   5.003790s
         precomputed      3.350M (± 0.9%) i/s -     17.018M in   5.080119s
              Struct      4.154M (± 1.5%) i/s -     21.188M in   5.101720s

Comparison:
                PORO:  2815622.0 i/s
              Struct:  4153942.4 i/s - 1.48x  faster
         precomputed:  3350230.3 i/s - 1.19x  faster

$ ruby --yjit /tmp/hash.rb
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) +YJIT [arm64-darwin23]
Warming up --------------------------------------
                PORO   407.112k i/100ms
         precomputed   429.832k i/100ms
              Struct   434.237k i/100ms
Calculating -------------------------------------
                PORO      4.383M (± 1.3%) i/s -     21.984M in   5.016207s
         precomputed      4.692M (± 1.0%) i/s -     23.641M in   5.039512s
              Struct      4.489M (± 4.8%) i/s -     22.580M in   5.045144s

Comparison:
                PORO:  4383350.0 i/s
         precomputed:  4691568.5 i/s - 1.07x  faster
              Struct:  4489005.9 i/s - same-ish: difference falls within error

This would suggest to use Struct.

pedro-stanaka commented 2 months ago

This would suggest to use Struct.

Interesting, I will test it out in my branch.

For now, here is some results with Object as key and using pre-computed hash: yjit-benchmark-results.txt benchmark-results.txt

Summary

SERIES_COUNT=5

UDP Batched: 427,910.0 events/s, Total time: 2.1s
UDP Batched with Aggregation: 510,489.4 events/s, Total time: 1.76s

SERIES_COUNT=10

UDP Batched: 445,409.8 events/s, Total time: 2.02s
UDP Batched with Aggregation: 527,989.3 events/s, Total time: 1.7s

SERIES_COUNT=15

UDP Batched: 407,327.7 events/s, Total time: 2.21s
UDP Batched with Aggregation: 509,685.2 events/s, Total time: 1.77s

SERIES_COUNT=20

UDP Batched: 423,564.5 events/s, Total time: 2.12s
UDP Batched with Aggregation: 509,267.2 events/s, Total time: 1.77s

SERIES_COUNT=30

UDP Batched: 420,049.4 events/s, Total time: 2.14s
UDP Batched with Aggregation: 504,341.5 events/s, Total time: 1.78s

SERIES_COUNT=60

UDP Batched: 419,031.7 events/s, Total time: 2.15s
UDP Batched with Aggregation: 520,494.8 events/s, Total time: 1.73s

casperisfine commented 2 months ago

SERIES_COUNT=5 UDP Batched: 427,910.0 events/s, Total time: 2.1s UDP Batched with Aggregation: 510,489.4 events/s, Total time: 1.76s

TBH I was expecting a much bigger difference is such favorable scenario. Also surprised to see the number don't change that much when cardinality changes for either implementation. I'd be curious to see a profile of that benchmark.

pedro-stanaka commented 2 months ago

TBH I was expecting a much bigger difference is such favorable scenario. Also surprised to see the number don't change that much when cardinality changes for either implementation. I'd be curious to see a profile of that benchmark.

The emitting threads are just dominated by the events which are not aggregated:

Here is the Vernier file:

benchmark_profile_udp_async_aggregated.json

pedro-stanaka commented 2 months ago

I am not going to use Struct for now, it is actually making performance worse:

Pre computed object:

▶ RUBY_YJIT_ENABLE=true ITERATIONS=6000 SERIES_COUNT=5 THREAD_COUNT=5 ./benchmark/local-udp-throughput                                                                                                                                                                                                               
===== UDP batched throughput (5 threads) - total events: 900000 =====
events: 433,375.1/s. Total time: 2.08s
===== UDP batched with aggregation throughput (5 threads) - total events: 900000 =====
events: 510,911.1/s. Total time: 1.76s
===== UDP batched with aggregation and 5 second interval throughput (5 threads) - total events: 900000 =====
events: 511,985.3/s. Total time: 1.76s

Struct:

▶ RUBY_YJIT_ENABLE=true ITERATIONS=6000 SERIES_COUNT=5 THREAD_COUNT=5 ./benchmark/local-udp-throughput                                                                                                                                                                                                                   
===== UDP sync throughput (5 threads) - total events: 900000 =====
events: 134,456.6/s. Total time: 6.69s
===== UDP batched throughput (5 threads) - total events: 900000 =====
events: 427,699.9/s. Total time: 2.1s
===== UDP batched with aggregation throughput (5 threads) - total events: 900000 =====
events: 475,442.3/s. Total time: 1.89s
===== UDP batched with aggregation and 5 second interval throughput (5 threads) - total events: 900000 =====
events: 480,398.2/s. Total time: 1.87s

ahayworth commented 2 months ago

@ahayworth we're indeed value-packing distributions/histograms. This is done in these places:

@pedro-stanaka I meant that I see no calls to the aggregator from the Client#histogram method - I think maybe it was just missed?

pedro-stanaka commented 2 months ago

@ahayworth we're indeed value-packing distributions/histograms. This is done in these places:

@pedro-stanaka I meant that I see no calls to the aggregator from the Client#histogram method - I think maybe it was just missed?

Duh, my bad. Yeah, I missed that one. Thanks for pointing it out.

casperisfine commented 2 months ago

The emitting threads are just dominated by the events which are not aggregated:

It's weird, that doesn't make sense to me. The whole idea behind this PR is that emiting the metric is more expensive than grouping it. Even with SERIES_COUNT=5, the vast majority of metrics should be grouped, so it should cut down a lot on how many packets are generated and sent.

Like, either something isn't working as intended, or either the benchmark isn't doing what I think, or either this is a ton of extra complexity for not a whole lot of performance gain.

pedro-stanaka commented 2 months ago

Like, either something isn't working as intended, or either the benchmark isn't doing what I think, or either this is a ton of extra complexity for not a whole lot of performance gain.

DogStatsD is big mess TBH, this PR concerns aggregating metrics, but in our benchmark we are also sending all the other type of "events" that just Datadog Agent can understand like service checks and events.

If I change the benchmark to just send metrics, the results are dramatically different:

✗ RUBY_YJIT_ENABLE=true ITERATIONS=6000 SERIES_COUNT=5 THREAD_COUNT=5 ./benchmark/local-udp-throughput                                                                                                                                                                                                🕙 15:00:59 
===== UDP sync throughput (5 threads) - total events: 900000 =====
events: 133,204.5/s. Total time: 6.76s
===== UDP batched throughput (5 threads) - total events: 900000 =====
events: 450,433.1/s. Total time: 2.0s
===== UDP batched with aggregation throughput (5 threads) - total events: 900000 =====
events: 836,921.2/s. Total time: 1.08s
===== UDP batched with aggregation and 5 second interval throughput (5 threads) - total events: 900000 =====
events: 837,549.0/s. Total time: 1.07s

You can see that we almost double throughput with aggregation. For Shopify itself, we have a skewed distribution of samples where we publish much more timing metrics (mostly distribution). So if we wanted to make a more "real-world" comparison, we would need to send more samples of that type here as well.

casperisfine commented 2 months ago

but in our benchmark we are also sending all the other type of "events" that just Datadog Agent can understand like service checks and events.

If I change the benchmark to just send metrics, the results are dramatically different:

Ah, just remove that from the benchmark.

pedro-stanaka commented 1 month ago

@casperisfine do you have any other remarks here? Can I proceed with merge?

Shopify / statsd-instrument