Address message aggregation overhead

richardstartin commented 2 years ago

We frequently see high overhead in message aggregation:

Screenshot 2022-09-28 at 13 26 45

This PR consists of 3 parts:

Use Map.putIfAbsent to avoid doubling the work done in Map.get then Map.put (note the almost identical stack traces next to eachother). This has to be done via MethodHandles but this can be greatly simplified when JDK7 support is dropped.
Tweak Message for better hashing (note the TreeMap frames below HashMap which indicate hash collisions):
- include the type in hashCode to avoid collisions in more cases
- Use Arrays.hashCode instead of Objects.hash for the tags because the latter triggers an IDE warning for ambiguous varargs parameters
- Remove the mutable done field from equals which can lead to memory leaks if the value changes after inserting a key into the map
- short circuit equals when compared with this
- Implement Comparable for better degradation when has collisions do occur
Change the aggregate types to EnumMap and make it static since it can be. The profile doesn't justify the change and I can drop the commit, but it's a generally worthwhile change.

richardstartin commented 2 years ago

This has led to a really good improvement after deploying to the same service

Screenshot 2022-09-29 at 12 48 06

richardstartin commented 2 years ago

It appears that the defective equals was already partially addressed in #194 but hasn't been released yet. The changes here are still necessary though:

In AlphaNumericMessage.equals there is no need to call super.equals twice
Message.equals must not depend on mutable done if it is to be used as a Map key
equals(this) can be short-cut
Implementing Comparable for map keys limits how bad things can get if hashes collide
Using putIfAbsent halves the cost which halves how bad things can get when hashes collide

However, the improvement we see after deploying this PR cannot be solely attributed to it and will be partially because of #194

richardstartin commented 2 years ago

I can confirm that the majority of the improvement we see is explained by the changes already made in #194 - it would be great if that could be released and pushed out to users. It's not critical to merge the changes made here, though I do still consider them beneficial.

richardstartin commented 2 years ago

Having monitored the service on the master branch of dogstatsd overnight, we have seen cases where implementing comparable and using Map.putIfAbsent would have been beneficial:

Since this takes place within a lock, it can be seen to contribute to the contention profile, so it's worth minimising time spent holding the lock

richardstartin commented 2 years ago

I created a very simple benchmark here to demonstrate the benefits of this change. The benchmark aggregates numeric messages using 4 benchmark threads, contending on an aggregator with 4 shards, measuring aggregation throughput at saturation. Increasing the number of shards doesn’t appear to change much in this set up, but may well do in highly concurrent environments.

The benchmark is parameterised:

it has a set of tag values which produce hash collisions
it has another set of tag values which do not produce hash collisions

These values may look silly, but hash collisions are a fact of life with polynomial hash codes, even if contrived cases look... contrived.

Running on the released 4.1.0 (make sure to clear out your .m2 cache before building to make sure it is downloaded from Maven Central though because the project doesn't use a SNAPSHOT version), when there are hash collisions, the throughput roughly halves:

Benchmark                                       (numAspects)  (numTags)                                (rawTags)   Mode  Cnt      Score     Error   Units
DogStatsDAggregationBenchmark.aggregateCounter             1         10  C}~,C~_,D^~,D__,D`@,Da!,E?~,E@_,EA@,EB!  thrpt    5  20778.006 ± 259.774  ops/ms
DogStatsDAggregationBenchmark.aggregateCounter             1         10                      0,1,2,3,4,5,6,7,8,9  thrpt    5  39062.510 ± 778.435  ops/ms

On this branch, the throughput still degrades under collisions, but is improved in both cases

Benchmark                                       (numAspects)  (numTags)                                (rawTags)   Mode  Cnt      Score      Error   Units
DogStatsDAggregationBenchmark.aggregateCounter             1         10  C}~,C~_,D^~,D__,D`@,Da!,E?~,E@_,EA@,EB!  thrpt    5  26633.658 ±  850.338  ops/ms
DogStatsDAggregationBenchmark.aggregateCounter             1         10                      0,1,2,3,4,5,6,7,8,9  thrpt    5  48218.414 ± 1379.077  ops/ms

However, the case with collisions here isn’t pathological enough to cause degradation to HashMap$TreeNode storage as was seen happening on the master branch (see screenshot in comment), which is where implementing Comparable comes in because Comparable keys produce more balanced trees. If I had more time for this I would construct benchmark parameters where there are enough collisions (there need to be more than 8 keys in a bucket in a HashMap with size larger than 64 for this to happen) but I feel the benchmark already justifies the simple changes made here.

jbachorik commented 2 years ago

Hi @vickenty - I wonder, are there any fundamental issues left for this PR to be resolved? If not, could we get it approved and merged? This would be a nice 'performance win' :)

DataDog / java-dogstatsd-client

Address message aggregation overhead #203