tail latency benchmark - Githubissues

jonathanstrong commented 6 years ago

Hey,

Just did a benchmark that's a bit different from what you have published in the crate and thought it would be helpful to you to have the results. This is a latency test, under the following conditions:

4 consuming threads, each on an isolated (logical) core (xeon cpu, ubuntu 16.04 lowlatency kernel, on a cloud server with "dedicated vCpu," whatever that is). These threads spin on try_recv and are (basically) the only activity on the core.
20 producing threads (across four other cores, which were also performing other non-saturating work) that (relatively) sporadically send messages (each sleep small randomized duration between each send) to each of the consuming threads. At first I tried to clone the Receiver, but then realized that messages would only go to one of the cloned instances, so used a Vec<Receiver<_>>, spinning on try_send one at a time. (Note: the legend in the benchmark is wrong on this front, what was really tested was a Vec of mpsc channels.)
The point of this test is to model situations where many threads are listening for incoming data (via tcp, etc) and several threads (with some authoritative state) are standing ready to process any data supplied by the listeners as quickly as possible when it arrives.
The producing threads send a chrono::DateTime<Utc>, and when received, a receiving thread compares the sent time to a new call to Utc::now() and records the duration in nanos with an HdrHistogram (there is some measurement noise here, but best method I have found so far).
Test ran over four hours, and around 11.2 million messages were sent/received.
Same test was run with multiqueue's BroadcastSender and BroadcastReceiver for comparison. multiqueue is modeled after lmax disruptor.
I don't have time at the moment to produce a sanitized version of the code for you, but it was very similar to this gist I previously shared with multiqueue.

Here is a timeline/log histogram view of the results:

A couple other looks at the histogram:

Full results (should work with typical HdrHistogram log viewers when decompressed, "v2z" is extension I came up with, not sure what other people use): multiqueue-crossbeam-channel-tail-latency.v2z.gz

I appreciate that you guys have posted detailed benchmarks of your library compared to others as well as working on an improved channels implementation in general. Hope this helps you!

ghost commented 6 years ago

Thanks so much for the very detailed report!

It looks like the tail latency of crossbeam-channel is slightly worse than the latency of multiqueue. I'm curious if you have any commentary on these results? Would the difference in tail latency between these two crates matter in your use cases?

Also, FYI: This crate is going through a big revamp in #41 and the performance characteristics will almost surely change afterwards (for better or worse, but hopefully better).

jonathanstrong commented 6 years ago

Yes, it's slightly worse, although it was surprising to me that it was so close since the other library is more geared towards latency (as far as I know). In my use case this is a key metric and I'm using whatever is fastest, within reason.

Thanks for the heads up about the upcoming changes. I'm definitely watching this and the other crossbeam libraries with interest as they develop. Will post back with any big changes I notice. Had also wanted to check mpsc against crossbeam_channel head-to-head.

One other question, do you have any plans (or would you consider) offering a "broadcast"-type channel in the library directly? By broadcast I mean spmc (or mpmc), but each consumer gets every message. I'm not sure how common it is elsewhere but in my use cases it's frequent to need to send the same data to multiple threads, and it's been tougher to find good examples or info about how to do that the best way.

Object905 commented 6 years ago

@jonathanstrong take a look at the bus crate.

ghost commented 6 years ago

@jonathanstrong There's a simple broadcasting adapter in this PR: https://github.com/crossbeam-rs/crossbeam-channel/pull/33

Would something like that work for you?

ghost commented 6 years ago

@jonathanstrong

Just published version 0.2.0 of crossbeam-channel, which brings noticeable performance improvements in my benchmarks. I wonder how it'd fare on your tail latency benchmark so if you could run it again, that'd be awesome!

jonathanstrong commented 6 years ago

may be a few days, but will do! thanks for the heads up.

jonathanstrong commented 6 years ago

results are in!

benchmark details:

designed to mimic having a moderate number of listening threads waiting on IO of some kind, with the goal of acting as quickly as possible when it arrives
20 senders (one thread each), 4 consumers (one thread each)
each consumer thread pinned to a cpu core that has nothing else happening on it (via isolcpus). also, generally no other activity on the machine during the tests
consumer threads busy spin on the receiver
producer threads send sporadically, sleeping in between sends
producer threads send the current time, and consumer threads compare the time when they receive the message to calculate elapsed duration (potential measurement noise from the system clock, but best method I have found)
compiled in release mode with lto, but not target-cpu=native
server is from a cloud provider with a virtualized environment, but a "cpu optimized" premium tier
used several system commands to give the process higher priority
ran for 10 minutes each (couldn't do 4 hours this time, unfortunately). Around 45k messages processed by each consumer
used crossbeam bounded with capacity 8
each producer thread has a Vec of Senders, and sends a copy of the time to each sequentially
compared against multiqueue::broadcast_queue (mpmc) and std::mpsc channels (mpsc with same Vec<Sender<_>> approach as crossbeam). multiqueue (paging @schets) is based on the LMAX disruptor pattern and is designed with latency in mind specifically

First up, std::mpsc. At first glance it had dramatically worse results than the other two libraries, with an ugly 2ms worst case:

However, this only happened once at the beginning of the run, which is pretty common when measuring latency (but didn't happen to the others). Hopefully any programs relying on std are allowed to warm up before money or lives are on the line. Here is the data with the spike excluded:

99.99% at 30u, with a worst case of ~100u. Not terrible.

Now, for our main event, crossbeam_channel vs multiqueue:

On worst case, multiqueue still edges out crossbeam by a bit, but crossbeam is arguably better across the board.

Excluding the worst spike for each:

Closeup of a distinct edge for crossbeam around 99%:

Both libraries show a persistent, up-to 10u edge over std::mpsc from 90% on:

Raw hdrhistogram log data: crossbeam-v0.2-latency-bench.v2z.gz

But wait, there's more! Since the results were so close, I ran crossbeam and multiqueue for another 20 minutes each (around 950k messages):

Best look (worst spike removed for each):

The additional data generally confirms the first run. Crossbeam is a titch slower at the far tail end (possibly measurement noise), and a titch faster around 99%.

Raw hdrhistogram log data for the second run: crossbeam-v0.2-latency-bench-2.gz

Final notes: I plan on submitting the benchmark as a pull request at some point, but need to clean it up and untangle some proprietary code from it first. Note, however, it's unlikely you would get similar results running this on your laptop while doing other work.

Edit: realized after the fact that some of the charts label the unit as milliseconds. That's by mistake, all the measurements are in nanoseconds.

schets commented 6 years ago

Nice benchmarks! One thing I noticed is that the multiqueue gist you posted here uses a blocking wait internally, so even though you only try_recv the senders still have to deal with that (albeit post-send, so it's more of a throughput thing).

I suspect the latency differences near the tails come about from minor implementation differences - multiqueue-broadcast for example has to explicitly track reader indices and writer-contention on reading/updating that internal view might be related to the difference around the 99th%, especially with such a small queue. I suspect that multiqueue winning at the tail comes about as a result of crossbeam-channel having to write to multiple receivers to achieve broadcast if I understand your benchmark right.

Multiqueue supports multiple writers, but most of the implementation is optimized for single writers with broadcast spsc using relatively large queue sizes.

I'm actually in the process of porting over some optimizations to crossbeam-channel that might give it the upper hand in your implementation

crossbeam-rs / crossbeam

tail latency benchmark #200

benchmark details: