Closed benjamin-bader closed 5 years ago
On second look, LongAdder
seems like it could be simpler to implement in c++ than in Java. It may even be more performant - instead of an array of pointers to individually-allocated cells, we can just make a single aligned buffer of cache-line-sized counters, and take advantage of locality (maybe? does locality come into play when each element of the array is the size of one cache line?).
We could also just do it Java-style and reduce up-front memory costs, since each cell is quite large.
Just so it's clear from reading the issue (and not also perusing the JDK sources), a LongAdder is (loosely defined) a hashtable of atomic counters indexed by thread ID. The table is initially empty, but as threads contend for the base counter, it grows. It continues to grow with contention until it reaches a size >= the number of CPU cores. Meanwhile, each thread maintains its own local index into the table. If a thread contends with another for a counter, it will "rehash" its index and try again.
Eventually, with threads that are assigned to processors, this will stabilize into a contention-free perfect hash. Nobody (who would care about this codebase) actually writes code that assigns one thread per processor, but extensive experience has shown that this design is much more performant than the naïve approach of just incrementing a single atomic counter.
The devil is in the details, of course - to keep things speedy, Java uses a hand-rolled spinlock for creating or updating the table, it's all very delicate, and is the kind of concurrent code only Doug Lea could write. Reading it feels a bit like watching videos from the Isle of Man TT.
Finally, all this writing to myself is motivated by observed performance in a service whose code or benchmarks I'm unfortunately not able to make public. Suffice to say that counters using simple std::atomic_long
were a surprising bottleneck.
I have a more-or-less transliteration of LongAdder that is functional, and while benchmarking shows it to be 2x faster than simple atomics under contention, benchmarking also shows it to be 5x slower than Java's LongAdder. I'll throw up a PR, but it seems like something is really off for there to be such a delta between the implementations.
Mystery solved - running a release build of our benchmark shows it as 1.7x faster than the Java equivalent. Not so bad after all! This was a rookie benchmarking error, no doubt I'm making many more.
details: late-2018 15" MBP, 2.2 Ghz Core i7, 16GB RAM bench is 24 threads each doing 1,000,000 increments (see LongAdderTests.cc and LongAdderBench.java) median of 10 executions: 30ms (cppmetrics) vs 52ms (java LongAdder)
Dropwizard metrics uses a
LongAdder
to defer/avoid synchronization in their counter implementations. No such thing exists in the c++ stdlib, unfortunately, and counters definitely suffer for the lack. They're subtle things to implement, so I haven't attempted it, but it would be important to implement before using this in a high-throughput and/or low-latency service.