Question: PyO3 wrapped snappy, slower than python-snappy

milesgranger commented 4 years ago

Hi there,

First, thanks for a lovely lib, both here and all your other contributions in Rust! :100:

As a learning exercise I've made a Python lib wrapping various de/compression libs in Rust: cramjam and in running the benchmarks for the Snappy comparison, it appears slower than python-snappy by a fair margin (granted both are quite quick), however, given your benchmarks I was expecting they'd be a bit closer.

Here I expose the interface and was curious if you would be able to give me any pointers on things I may have missed.

Appreciate the help. :smiley:

BurntSushi commented 4 years ago

Thanks for reaching out. I'm always happy to take a look at performance concerns.

I peaked at your binding code (and python-snappy's binding code), and I didn't see anything obviously amiss.

As far as your benchmarks go, while they do appear to be comparing apples-to-apples, your benchmarks are deeply biased. Not only are you only benchmarking a single data set, but you're benchmarking a pathological one: a file that is the same byte repeated. It's entirely plausible that the reference implementation is better optimized for that case while not being faster in all cases.

You can actually run benchmarks directly against both the C++ reference implementation and the Rust implementation. Moreover, the benchmarks consist of many different types of data commonly found in the wild. You can run them like so:

$ cd bench
$ cargo bench --features cpp -- --save-baseline snap01

And to view a nice comparative output, you'll want to cargo install critcmp and use that:

$ critcmp snap01 -g '(?:cpp|snap)(.*)'
group                          snap01/cpp                             snap01/snap
-----                          ----------                             -----------
/compress/zflat00_html         1.00     93.8±0.53µs  1041.4 MB/sec    1.02     95.8±0.53µs  1019.4 MB/sec
/compress/zflat01_urls         1.00   1170.5±8.21µs   572.0 MB/sec    1.04   1213.6±6.82µs   551.7 MB/sec
/compress/zflat02_jpg          1.00      7.1±0.06µs    16.2 GB/sec    1.05      7.4±0.09µs    15.4 GB/sec
/compress/zflat03_jpg_200      1.18    263.0±1.39ns   725.2 MB/sec    1.00    222.3±1.27ns   858.1 MB/sec
/compress/zflat04_pdf          1.00     10.5±0.12µs     9.1 GB/sec    1.00     10.5±0.06µs     9.1 GB/sec
/compress/zflat05_html4        1.00    394.1±2.24µs   991.1 MB/sec    1.02    401.6±4.23µs   972.7 MB/sec
/compress/zflat06_txt1         1.00    394.7±2.71µs   367.5 MB/sec    1.00    395.0±3.95µs   367.2 MB/sec
/compress/zflat07_txt2         1.00    350.7±3.12µs   340.4 MB/sec    1.00    350.6±1.81µs   340.5 MB/sec
/compress/zflat08_txt3         1.00   1048.4±4.65µs   388.2 MB/sec    1.00   1048.1±7.25µs   388.3 MB/sec
/compress/zflat09_txt4         1.00   1437.1±7.30µs   319.8 MB/sec    1.01  1447.7±14.25µs   317.4 MB/sec
/compress/zflat10_pb           1.00     84.7±0.50µs  1335.8 MB/sec    1.02     86.4±0.48µs  1308.8 MB/sec
/compress/zflat11_gaviota      1.06    307.4±2.87µs   571.8 MB/sec    1.00    290.4±1.43µs   605.4 MB/sec
/decompress/uflat00_html       1.00     36.3±0.21µs     2.6 GB/sec    1.03     37.4±0.23µs     2.5 GB/sec
/decompress/uflat01_urls       1.00    433.2±2.38µs  1545.7 MB/sec    1.05    455.1±2.39µs  1471.3 MB/sec
/decompress/uflat02_jpg        1.00      4.4±0.04µs    25.9 GB/sec    1.00      4.4±0.03µs    26.0 GB/sec
/decompress/uflat03_jpg_200    1.06    120.4±0.68ns  1583.9 MB/sec    1.00    113.2±0.69ns  1684.8 MB/sec
/decompress/uflat04_pdf        1.00      5.6±0.03µs    17.1 GB/sec    1.09      6.1±0.07µs    15.6 GB/sec
/decompress/uflat05_html4      1.00    161.4±1.00µs     2.4 GB/sec    1.06    171.8±1.17µs     2.2 GB/sec
/decompress/uflat06_txt1       1.01    145.2±0.79µs   999.2 MB/sec    1.00    143.4±0.47µs  1011.4 MB/sec
/decompress/uflat07_txt2       1.02    129.0±0.70µs   925.2 MB/sec    1.00    126.4±0.66µs   944.5 MB/sec
/decompress/uflat08_txt3       1.01    385.9±1.72µs  1054.7 MB/sec    1.00    380.3±5.06µs  1070.2 MB/sec
/decompress/uflat09_txt4       1.01    531.9±2.95µs   864.0 MB/sec    1.00    527.7±2.34µs   870.9 MB/sec
/decompress/uflat10_pb         1.00     31.9±0.19µs     3.5 GB/sec    1.09     34.8±0.24µs     3.2 GB/sec
/decompress/uflat11_gaviota    1.00    140.2±0.84µs  1253.6 MB/sec    1.06    148.0±1.60µs  1188.0 MB/sec

As you can see, the reference implementation has the edge in a number of cases, but not all and not by much. The reference implementation has likely grown some optimizations over the years since I wrote this implementation in Rust. Anyone is welcome to find them and port them to Rust code. I don't think I'll be doing it myself any time soon because as far as I can tell, I'd still call this library "on par" with the C++ implementation.

For your case, I'd recommend your next step to be making your benchmark less biased. The best route would probably be to reproduce the benchmarks used by this crate, and then run them on your end. If you see vastly different results comparatively, then perhaps there is something wrong in the binding code somewhere. But if you see results comparatively consistent with the above, then the mystery is solved: the reference implementation just happens to be quite a bit better on one pathological case.

milesgranger commented 4 years ago

Thank you for such a full reply!

I hope I did not elude that this disparity was in any way the fault of snap, I knew I must have messed up somewhere or be it the overhead in the bindings.

Indeed, I've failed in the implementation of the benchmark data; great that you pointed that out and how best to take action on it. I'll be sure to reference this crate's benchmarks and make mine less biased. I knew something must have been on my end as demonstrated by the thorough benchmarks in this crate, so I'm happy to have your opinion on it now.

Can't thank you enough for your thoughtful response and again with all the work you do in Rust. :clap:

milesgranger commented 4 years ago

Hey @BurntSushi

I added a PR, it seems it was indeed my failure in generating better data. Would you mind looking it over and specifically, if it was okay to copy the data used in this crate? I did add the COPYING file, but maybe I should reference it came from this crate?

Thanks again!

BurntSushi commented 4 years ago

@milesgranger I took the data from the reference implementation. I've also taken ideas from the reference implementation, which is why this crate uses the same license as the reference implementation. Other snappy implementations, like the one in Go, also use the benchmark data.

BurntSushi / rust-snappy

Question: PyO3 wrapped snappy, slower than python-snappy #34