Closed pedro-stanaka closed 1 month ago
@ahayworth @casperisfine I have updated the PR addressing most, if not all, points from your reviews. Let me know what you think of this version. Going to wait for #368 so I can benchmark this properly.
One question I had though - is there any value-packing supported for histograms? I noticed we don't call the aggregator for that at all. It's not used heavily internally as far as I can tell, but it did jump out at me.
@ahayworth we're indeed value-packing distributions/histograms. This is done in these places:
FYI:
# frozen_string_literal: true
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'benchmark-ips'
end
class AggregationKey
attr_reader :name, :tags, :no_prefix, :type
def initialize(name, tags, no_prefix, type)
@name = name
@tags = tags
@no_prefix = no_prefix
@type = type
end
def ==(other)
other.is_a?(self.class) &&
@name == other.name &&
@tags == other.tags &&
@no_prefix == other.no_prefix &&
@type == other.type
end
alias_method :eql?, :==
def hash
[@name, @tags, @no_prefix, @type].hash
end
end
class PrecomputedAggregationKey < AggregationKey
attr_reader :hash
def initialize(name, tags, no_prefix, type)
super
@hash = [@name, @tags, @no_prefix, @type].hash
end
end
StructKey = Struct.new(:name, :tags, :no_prefix, :type)
registry = {}
10.times do |i|
registry[AggregationKey.new("some.key", "some_tag:#{i}", false, :c)] = true
end
10.times do |i|
registry[PrecomputedAggregationKey.new("some_other.key", "some_tag:#{i}", false, :c)] = true
end
10.times do |i|
registry[StructKey.new("struct.key", "some_tag:#{i}", false, :c)] = true
end
poro = AggregationKey.new("some.key", "some_tag:1", false, :c)
precomputed = PrecomputedAggregationKey.new("some.key", "some_tag:1", false, :c)
struct = StructKey.new("some.key", "some_tag:1", false, :c)
Benchmark.ips do |x|
x.report("PORO") { registry[poro] }
x.report("precomputed") { registry[precomputed] }
x.report("Struct") { registry[struct] }
x.compare!(order: :baseline)
end
$ ruby /tmp/hash.rb
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23]
Warming up --------------------------------------
PORO 375.256k i/100ms
precomputed 586.333k i/100ms
Struct 640.637k i/100ms
Calculating -------------------------------------
PORO 3.750M (± 1.0%) i/s - 18.763M in 5.003452s
precomputed 5.924M (± 1.2%) i/s - 29.903M in 5.048967s
Struct 6.424M (± 1.4%) i/s - 32.672M in 5.087137s
Comparison:
PORO: 3750348.0 i/s
Struct: 6423931.0 i/s - 1.71x faster
precomputed: 5923525.8 i/s - 1.58x faster
$ ruby --yjit /tmp/hash.rb
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) +YJIT [arm64-darwin23]
Warming up --------------------------------------
PORO 528.414k i/100ms
precomputed 608.157k i/100ms
Struct 600.500k i/100ms
Calculating -------------------------------------
PORO 5.653M (± 1.5%) i/s - 28.534M in 5.049266s
precomputed 6.675M (± 5.2%) i/s - 33.449M in 5.028044s
Struct 6.164M (± 4.6%) i/s - 31.226M in 5.076908s
Comparison:
PORO: 5652529.0 i/s
precomputed: 6674636.5 i/s - 1.18x faster
Struct: 6164451.7 i/s - 1.09x faster
TL;DR; Precomputed PORO is a little bit faster with YJIT, but Struct wins without YJIT. Difference between the two isn't huge, so I think precomputed
is fine.
If I change the benchmark to also include the key allocation (which is more realistic)
Benchmark.ips do |x|
x.report("PORO") { registry[AggregationKey.new("some.key", "some_tag:1", false, :c)] }
x.report("precomputed") { registry[PrecomputedAggregationKey.new("some.key", "some_tag:1", false, :c)] }
x.report("Struct") { registry[StructKey.new("some.key", "some_tag:1", false, :c)] }
x.compare!(order: :baseline)
end
$ ruby /tmp/hash.rb
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23]
Warming up --------------------------------------
PORO 276.197k i/100ms
precomputed 327.271k i/100ms
Struct 415.442k i/100ms
Calculating -------------------------------------
PORO 2.816M (± 1.4%) i/s - 14.086M in 5.003790s
precomputed 3.350M (± 0.9%) i/s - 17.018M in 5.080119s
Struct 4.154M (± 1.5%) i/s - 21.188M in 5.101720s
Comparison:
PORO: 2815622.0 i/s
Struct: 4153942.4 i/s - 1.48x faster
precomputed: 3350230.3 i/s - 1.19x faster
$ ruby --yjit /tmp/hash.rb
ruby 3.3.3 (2024-06-12 revision f1c7b6f435) +YJIT [arm64-darwin23]
Warming up --------------------------------------
PORO 407.112k i/100ms
precomputed 429.832k i/100ms
Struct 434.237k i/100ms
Calculating -------------------------------------
PORO 4.383M (± 1.3%) i/s - 21.984M in 5.016207s
precomputed 4.692M (± 1.0%) i/s - 23.641M in 5.039512s
Struct 4.489M (± 4.8%) i/s - 22.580M in 5.045144s
Comparison:
PORO: 4383350.0 i/s
precomputed: 4691568.5 i/s - 1.07x faster
Struct: 4489005.9 i/s - same-ish: difference falls within error
This would suggest to use Struct
.
This would suggest to use Struct.
Interesting, I will test it out in my branch.
For now, here is some results with Object as key and using pre-computed hash: yjit-benchmark-results.txt benchmark-results.txt
SERIES_COUNT=5 UDP Batched: 427,910.0 events/s, Total time: 2.1s UDP Batched with Aggregation: 510,489.4 events/s, Total time: 1.76s
TBH I was expecting a much bigger difference is such favorable scenario. Also surprised to see the number don't change that much when cardinality changes for either implementation. I'd be curious to see a profile of that benchmark.
TBH I was expecting a much bigger difference is such favorable scenario. Also surprised to see the number don't change that much when cardinality changes for either implementation. I'd be curious to see a profile of that benchmark.
The emitting threads are just dominated by the events which are not aggregated:
Here is the Vernier file:
I am not going to use Struct for now, it is actually making performance worse:
Pre computed object:
▶ RUBY_YJIT_ENABLE=true ITERATIONS=6000 SERIES_COUNT=5 THREAD_COUNT=5 ./benchmark/local-udp-throughput
===== UDP batched throughput (5 threads) - total events: 900000 =====
events: 433,375.1/s. Total time: 2.08s
===== UDP batched with aggregation throughput (5 threads) - total events: 900000 =====
events: 510,911.1/s. Total time: 1.76s
===== UDP batched with aggregation and 5 second interval throughput (5 threads) - total events: 900000 =====
events: 511,985.3/s. Total time: 1.76s
Struct:
▶ RUBY_YJIT_ENABLE=true ITERATIONS=6000 SERIES_COUNT=5 THREAD_COUNT=5 ./benchmark/local-udp-throughput
===== UDP sync throughput (5 threads) - total events: 900000 =====
events: 134,456.6/s. Total time: 6.69s
===== UDP batched throughput (5 threads) - total events: 900000 =====
events: 427,699.9/s. Total time: 2.1s
===== UDP batched with aggregation throughput (5 threads) - total events: 900000 =====
events: 475,442.3/s. Total time: 1.89s
===== UDP batched with aggregation and 5 second interval throughput (5 threads) - total events: 900000 =====
events: 480,398.2/s. Total time: 1.87s
@ahayworth we're indeed value-packing distributions/histograms. This is done in these places:
@pedro-stanaka I meant that I see no calls to the aggregator from the Client#histogram
method - I think maybe it was just missed?
@ahayworth we're indeed value-packing distributions/histograms. This is done in these places:
@pedro-stanaka I meant that I see no calls to the aggregator from the
Client#histogram
method - I think maybe it was just missed?
Duh, my bad. Yeah, I missed that one. Thanks for pointing it out.
The emitting threads are just dominated by the events which are not aggregated:
It's weird, that doesn't make sense to me. The whole idea behind this PR is that emiting the metric is more expensive than grouping it. Even with SERIES_COUNT=5
, the vast majority of metrics should be grouped, so it should cut down a lot on how many packets are generated and sent.
Like, either something isn't working as intended, or either the benchmark isn't doing what I think, or either this is a ton of extra complexity for not a whole lot of performance gain.
Like, either something isn't working as intended, or either the benchmark isn't doing what I think, or either this is a ton of extra complexity for not a whole lot of performance gain.
DogStatsD is big mess TBH, this PR concerns aggregating metrics, but in our benchmark we are also sending all the other type of "events" that just Datadog Agent can understand like service checks and events.
If I change the benchmark to just send metrics, the results are dramatically different:
✗ RUBY_YJIT_ENABLE=true ITERATIONS=6000 SERIES_COUNT=5 THREAD_COUNT=5 ./benchmark/local-udp-throughput 🕙 15:00:59
===== UDP sync throughput (5 threads) - total events: 900000 =====
events: 133,204.5/s. Total time: 6.76s
===== UDP batched throughput (5 threads) - total events: 900000 =====
events: 450,433.1/s. Total time: 2.0s
===== UDP batched with aggregation throughput (5 threads) - total events: 900000 =====
events: 836,921.2/s. Total time: 1.08s
===== UDP batched with aggregation and 5 second interval throughput (5 threads) - total events: 900000 =====
events: 837,549.0/s. Total time: 1.07s
You can see that we almost double throughput with aggregation. For Shopify itself, we have a skewed distribution of samples where we publish much more timing metrics (mostly distribution
). So if we wanted to make a more "real-world" comparison, we would need to send more samples of that type here as well.
but in our benchmark we are also sending all the other type of "events" that just Datadog Agent can understand like service checks and events.
If I change the benchmark to just send metrics, the results are dramatically different:
Ah, just remove that from the benchmark.
@casperisfine do you have any other remarks here? Can I proceed with merge?
This is an extension of the work in #366, started as experimentation, but showed great potential in reducing not only network chat, but also CPU usage in the Ruby applications we use in production.