Closed milesgranger closed 4 years ago
Thanks for reaching out. I'm always happy to take a look at performance concerns.
I peaked at your binding code (and python-snappy
's binding code), and I didn't see anything obviously amiss.
As far as your benchmarks go, while they do appear to be comparing apples-to-apples, your benchmarks are deeply biased. Not only are you only benchmarking a single data set, but you're benchmarking a pathological one: a file that is the same byte repeated. It's entirely plausible that the reference implementation is better optimized for that case while not being faster in all cases.
You can actually run benchmarks directly against both the C++ reference implementation and the Rust implementation. Moreover, the benchmarks consist of many different types of data commonly found in the wild. You can run them like so:
$ cd bench
$ cargo bench --features cpp -- --save-baseline snap01
And to view a nice comparative output, you'll want to cargo install critcmp
and use that:
$ critcmp snap01 -g '(?:cpp|snap)(.*)'
group snap01/cpp snap01/snap
----- ---------- -----------
/compress/zflat00_html 1.00 93.8±0.53µs 1041.4 MB/sec 1.02 95.8±0.53µs 1019.4 MB/sec
/compress/zflat01_urls 1.00 1170.5±8.21µs 572.0 MB/sec 1.04 1213.6±6.82µs 551.7 MB/sec
/compress/zflat02_jpg 1.00 7.1±0.06µs 16.2 GB/sec 1.05 7.4±0.09µs 15.4 GB/sec
/compress/zflat03_jpg_200 1.18 263.0±1.39ns 725.2 MB/sec 1.00 222.3±1.27ns 858.1 MB/sec
/compress/zflat04_pdf 1.00 10.5±0.12µs 9.1 GB/sec 1.00 10.5±0.06µs 9.1 GB/sec
/compress/zflat05_html4 1.00 394.1±2.24µs 991.1 MB/sec 1.02 401.6±4.23µs 972.7 MB/sec
/compress/zflat06_txt1 1.00 394.7±2.71µs 367.5 MB/sec 1.00 395.0±3.95µs 367.2 MB/sec
/compress/zflat07_txt2 1.00 350.7±3.12µs 340.4 MB/sec 1.00 350.6±1.81µs 340.5 MB/sec
/compress/zflat08_txt3 1.00 1048.4±4.65µs 388.2 MB/sec 1.00 1048.1±7.25µs 388.3 MB/sec
/compress/zflat09_txt4 1.00 1437.1±7.30µs 319.8 MB/sec 1.01 1447.7±14.25µs 317.4 MB/sec
/compress/zflat10_pb 1.00 84.7±0.50µs 1335.8 MB/sec 1.02 86.4±0.48µs 1308.8 MB/sec
/compress/zflat11_gaviota 1.06 307.4±2.87µs 571.8 MB/sec 1.00 290.4±1.43µs 605.4 MB/sec
/decompress/uflat00_html 1.00 36.3±0.21µs 2.6 GB/sec 1.03 37.4±0.23µs 2.5 GB/sec
/decompress/uflat01_urls 1.00 433.2±2.38µs 1545.7 MB/sec 1.05 455.1±2.39µs 1471.3 MB/sec
/decompress/uflat02_jpg 1.00 4.4±0.04µs 25.9 GB/sec 1.00 4.4±0.03µs 26.0 GB/sec
/decompress/uflat03_jpg_200 1.06 120.4±0.68ns 1583.9 MB/sec 1.00 113.2±0.69ns 1684.8 MB/sec
/decompress/uflat04_pdf 1.00 5.6±0.03µs 17.1 GB/sec 1.09 6.1±0.07µs 15.6 GB/sec
/decompress/uflat05_html4 1.00 161.4±1.00µs 2.4 GB/sec 1.06 171.8±1.17µs 2.2 GB/sec
/decompress/uflat06_txt1 1.01 145.2±0.79µs 999.2 MB/sec 1.00 143.4±0.47µs 1011.4 MB/sec
/decompress/uflat07_txt2 1.02 129.0±0.70µs 925.2 MB/sec 1.00 126.4±0.66µs 944.5 MB/sec
/decompress/uflat08_txt3 1.01 385.9±1.72µs 1054.7 MB/sec 1.00 380.3±5.06µs 1070.2 MB/sec
/decompress/uflat09_txt4 1.01 531.9±2.95µs 864.0 MB/sec 1.00 527.7±2.34µs 870.9 MB/sec
/decompress/uflat10_pb 1.00 31.9±0.19µs 3.5 GB/sec 1.09 34.8±0.24µs 3.2 GB/sec
/decompress/uflat11_gaviota 1.00 140.2±0.84µs 1253.6 MB/sec 1.06 148.0±1.60µs 1188.0 MB/sec
As you can see, the reference implementation has the edge in a number of cases, but not all and not by much. The reference implementation has likely grown some optimizations over the years since I wrote this implementation in Rust. Anyone is welcome to find them and port them to Rust code. I don't think I'll be doing it myself any time soon because as far as I can tell, I'd still call this library "on par" with the C++ implementation.
For your case, I'd recommend your next step to be making your benchmark less biased. The best route would probably be to reproduce the benchmarks used by this crate, and then run them on your end. If you see vastly different results comparatively, then perhaps there is something wrong in the binding code somewhere. But if you see results comparatively consistent with the above, then the mystery is solved: the reference implementation just happens to be quite a bit better on one pathological case.
Thank you for such a full reply!
I hope I did not elude that this disparity was in any way the fault of snap
, I knew I must have messed up somewhere or be it the overhead in the bindings.
Indeed, I've failed in the implementation of the benchmark data; great that you pointed that out and how best to take action on it. I'll be sure to reference this crate's benchmarks and make mine less biased. I knew something must have been on my end as demonstrated by the thorough benchmarks in this crate, so I'm happy to have your opinion on it now.
Can't thank you enough for your thoughtful response and again with all the work you do in Rust. :clap:
Hey @BurntSushi
I added a PR, it seems it was indeed my failure in generating better data. Would you mind looking it over and specifically, if it was okay to copy the data used in this crate? I did add the COPYING
file, but maybe I should reference it came from this crate?
Thanks again!
@milesgranger I took the data from the reference implementation. I've also taken ideas from the reference implementation, which is why this crate uses the same license as the reference implementation. Other snappy implementations, like the one in Go, also use the benchmark data.
Hi there,
First, thanks for a lovely lib, both here and all your other contributions in Rust! :100:
As a learning exercise I've made a Python lib wrapping various de/compression libs in Rust: cramjam and in running the benchmarks for the Snappy comparison, it appears slower than python-snappy by a fair margin (granted both are quite quick), however, given your benchmarks I was expecting they'd be a bit closer.
Here I expose the interface and was curious if you would be able to give me any pointers on things I may have missed.
Appreciate the help. :smiley: