crossbeam-rs / crossbeam

Tools for concurrent programming in Rust
Apache License 2.0
7.44k stars 470 forks source link

tail latency benchmark #200

Open jonathanstrong opened 6 years ago

jonathanstrong commented 6 years ago

Hey,

Just did a benchmark that's a bit different from what you have published in the crate and thought it would be helpful to you to have the results. This is a latency test, under the following conditions:

Here is a timeline/log histogram view of the results:

image

A couple other looks at the histogram:

image

image

Full results (should work with typical HdrHistogram log viewers when decompressed, "v2z" is extension I came up with, not sure what other people use): multiqueue-crossbeam-channel-tail-latency.v2z.gz

I appreciate that you guys have posted detailed benchmarks of your library compared to others as well as working on an improved channels implementation in general. Hope this helps you!

ghost commented 6 years ago

Thanks so much for the very detailed report!

It looks like the tail latency of crossbeam-channel is slightly worse than the latency of multiqueue. I'm curious if you have any commentary on these results? Would the difference in tail latency between these two crates matter in your use cases?

Also, FYI: This crate is going through a big revamp in #41 and the performance characteristics will almost surely change afterwards (for better or worse, but hopefully better).

jonathanstrong commented 6 years ago

Yes, it's slightly worse, although it was surprising to me that it was so close since the other library is more geared towards latency (as far as I know). In my use case this is a key metric and I'm using whatever is fastest, within reason.

Thanks for the heads up about the upcoming changes. I'm definitely watching this and the other crossbeam libraries with interest as they develop. Will post back with any big changes I notice. Had also wanted to check mpsc against crossbeam_channel head-to-head.

One other question, do you have any plans (or would you consider) offering a "broadcast"-type channel in the library directly? By broadcast I mean spmc (or mpmc), but each consumer gets every message. I'm not sure how common it is elsewhere but in my use cases it's frequent to need to send the same data to multiple threads, and it's been tougher to find good examples or info about how to do that the best way.

Object905 commented 6 years ago

@jonathanstrong take a look at the bus crate.

ghost commented 6 years ago

@jonathanstrong There's a simple broadcasting adapter in this PR: https://github.com/crossbeam-rs/crossbeam-channel/pull/33

Would something like that work for you?

ghost commented 6 years ago

@jonathanstrong

Just published version 0.2.0 of crossbeam-channel, which brings noticeable performance improvements in my benchmarks. I wonder how it'd fare on your tail latency benchmark so if you could run it again, that'd be awesome!

jonathanstrong commented 6 years ago

may be a few days, but will do! thanks for the heads up.

jonathanstrong commented 6 years ago

results are in!

benchmark details:

First up, std::mpsc. At first glance it had dramatically worse results than the other two libraries, with an ugly 2ms worst case:

image

However, this only happened once at the beginning of the run, which is pretty common when measuring latency (but didn't happen to the others). Hopefully any programs relying on std are allowed to warm up before money or lives are on the line. Here is the data with the spike excluded:

image

99.99% at 30u, with a worst case of ~100u. Not terrible.

Now, for our main event, crossbeam_channel vs multiqueue:

image

On worst case, multiqueue still edges out crossbeam by a bit, but crossbeam is arguably better across the board.

Excluding the worst spike for each:

image

Closeup of a distinct edge for crossbeam around 99%:

image

Both libraries show a persistent, up-to 10u edge over std::mpsc from 90% on:

image

Raw hdrhistogram log data: crossbeam-v0.2-latency-bench.v2z.gz

But wait, there's more! Since the results were so close, I ran crossbeam and multiqueue for another 20 minutes each (around 950k messages):

image

Best look (worst spike removed for each):

image

The additional data generally confirms the first run. Crossbeam is a titch slower at the far tail end (possibly measurement noise), and a titch faster around 99%.

Raw hdrhistogram log data for the second run: crossbeam-v0.2-latency-bench-2.gz

Final notes: I plan on submitting the benchmark as a pull request at some point, but need to clean it up and untangle some proprietary code from it first. Note, however, it's unlikely you would get similar results running this on your laptop while doing other work.

Edit: realized after the fact that some of the charts label the unit as milliseconds. That's by mistake, all the measurements are in nanoseconds.

schets commented 6 years ago

Nice benchmarks! One thing I noticed is that the multiqueue gist you posted here uses a blocking wait internally, so even though you only try_recv the senders still have to deal with that (albeit post-send, so it's more of a throughput thing).

I suspect the latency differences near the tails come about from minor implementation differences - multiqueue-broadcast for example has to explicitly track reader indices and writer-contention on reading/updating that internal view might be related to the difference around the 99th%, especially with such a small queue. I suspect that multiqueue winning at the tail comes about as a result of crossbeam-channel having to write to multiple receivers to achieve broadcast if I understand your benchmark right.

Multiqueue supports multiple writers, but most of the implementation is optimized for single writers with broadcast spsc using relatively large queue sizes.

I'm actually in the process of porting over some optimizations to crossbeam-channel that might give it the upper hand in your implementation