PSeitz / lz4_flex

Fastest pure Rust implementation of LZ4 compression/decompression.
MIT License
460 stars 31 forks source link

Implement frame format #13

Closed arthurprs closed 3 years ago

arthurprs commented 3 years ago

Hello there :wave: I managed to get a mostly complete and efficient implementation for the frame format on top of your work. It'd be great if we could eventually reach a production level LZ4 implementation in Rust :tada:

Please forgive the size of the PR, it's a bit larger than it should be. I may have sneaked a few improvements in this PR. While I was grokking the codebase I fixed/changed a few bits to be clearer/idiomatic.

It'd be great if you could take an initial look :smile:

TODO:

PSeitz commented 3 years ago

Nice work! What I'm currently working on, and is probably the cause of the merge conflicts, is supporting slices as parameters instead of vecs (https://github.com/PSeitz/lz4_flex/issues/11). This is done to better support the frame format and also apply some performance optimizations. There are some implications which are not so easy to handle in the unsafe decompression, I'm currently looking into. It's probably better to use the new API.

Currently it accepts a Sink which is a u8 slice with pos. Not completely sure if this slice and pos should be wrapped or accepted as separate parameters in the API.

arthurprs commented 3 years ago

EDIT: nevermind this comment, I misunderstood your comment. I should've checked the latest master code first.

~Honestly, after writing this changeset I don't think supporting a Vec for output internally is worth the effort. The API level is unaffected as we can just create a properly sized Vec internally and use the backing slice internally. I agree this may have some slight performance implications in the unchecked decompressor but that's extremely unsafe.~

~But I really don't agree that a Vec helps in the frame format. The frame format enforces a strict max block size, which can be used to determine the max output size for the compressor and the the size of the output slice for the decompressor.~

arthurprs commented 3 years ago

I fixed the conflicts with master. I think the compressor is in a good state, if you want to start a closer review. I'll move onto the safe decompressor when time allows.

arthurprs commented 3 years ago

I can reproduce some regression with LTO enabled, but much smaller ones :thinking: I suspect there are differences in the benchmarks so I added a lz4_flex_master variant to the benchmarks (ed6e060) so they can be run alongside.

W/O LTO faster (see LTO bellow)

Compress/lz4_flexx_rust/725                                                                             
                        time:   [1.2502 us 1.2516 us 1.2529 us]
                        thrpt:  [551.83 MiB/s 552.44 MiB/s 553.05 MiB/s]
                 change:
                        time:   [-1.7243% -1.5582% -1.3910%] (p = 0.00 < 0.05)
                        thrpt:  [+1.4106% +1.5829% +1.7545%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
Compress/lz4_flexx_rust_master/725                                                                             
                        time:   [1.7235 us 1.7285 us 1.7336 us]
                        thrpt:  [398.83 MiB/s 400.01 MiB/s 401.18 MiB/s]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
Compress/lz4_flexx_rust/34308                                                                            
                        time:   [92.714 us 93.004 us 93.467 us]
                        thrpt:  [350.06 MiB/s 351.80 MiB/s 352.90 MiB/s]
                 change:
                        time:   [-2.1264% -1.2183% +0.0655%] (p = 0.01 < 0.05)
                        thrpt:  [-0.0654% +1.2333% +2.1726%]
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
Compress/lz4_flexx_rust_master/34308                                                                            
                        time:   [101.93 us 102.23 us 102.68 us]
                        thrpt:  [318.65 MiB/s 320.04 MiB/s 320.98 MiB/s]
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe

W/ WTO 10% regression for 725 bytes 4% regression for 32KB The massive improvement from W/O LTO to W/ LTO for master suggests it was missing some inline annotations ( I added a few in this PR) and/or it's optimizing better (because of the smaller size).

Compress/lz4_flexx_rust/725                                                                             
                        time:   [1.2312 us 1.2455 us 1.2695 us]
                        thrpt:  [544.66 MiB/s 555.15 MiB/s 561.58 MiB/s]
                 change:
                        time:   [-1.7319% -1.1970% -0.5335%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5363% +1.2115% +1.7624%]
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  6 (6.00%) high mild
  6 (6.00%) high severe
Compress/lz4_flexx_rust_master/725                                                                             
                        time:   [1.1240 us 1.1265 us 1.1298 us]
                        thrpt:  [611.99 MiB/s 613.75 MiB/s 615.13 MiB/s]
                 change:
                        time:   [-35.620% -35.148% -34.771%] (p = 0.00 < 0.05)
                        thrpt:  [+53.307% +54.197% +55.327%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe
Compress/lz4_flexx_rust/34308                                                                            
                        time:   [90.973 us 91.036 us 91.102 us]
                        thrpt:  [359.14 MiB/s 359.40 MiB/s 359.65 MiB/s]
                 change:
                        time:   [-3.4910% -2.4520% -1.7169%] (p = 0.00 < 0.05)
                        thrpt:  [+1.7469% +2.5136% +3.6173%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
Compress/lz4_flexx_rust_master/34308                                                                            
                        time:   [87.705 us 87.840 us 87.979 us]
                        thrpt:  [371.89 MiB/s 372.48 MiB/s 373.05 MiB/s]
                 change:
                        time:   [-16.416% -15.292% -14.409%] (p = 0.00 < 0.05)
                        thrpt:  [+16.834% +18.052% +19.640%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe
PSeitz commented 3 years ago

Ah sorry I changed Cargo.toml on master to use unsafe by default from the last dev session, the comparison is invalid

PSeitz commented 3 years ago

Looks really good so far, nice job! Would be good to add some benchmarks with the frame format. It would also good to enable COMPRESSION66K, this is a JSON data example, which has often different performance characteristics.

arthurprs commented 3 years ago

I was able to reduce all sources of overhead using cost-generics and the non-dict version is now somewhat faster. There's overhead using an ext-dict but that's sort of unavoidable.

image

image

PSeitz commented 3 years ago

Nice usage of const generics!

PSeitz commented 3 years ago

I added some comments directly on the commits, I hope that's fine

billyb2 commented 3 years ago

Just out of curiosity, how would I use the frame decompression? Could you please give an example? Thank you 😄

arthurprs commented 3 years ago

@billyb2 here are the examples, it's likely that similar functions will be added to the crate.

fn compress_frame(input: &[u8]) -> Vec<u8> {
    let mut enc = FrameEncoder::new(Vec::new());
    enc.write_all(input).unwrap();
    enc.finish().unwrap()
}

fn decompress_frame(input: &[u8]) -> Result<Vec<u8>, lz4_flex::frame::Error> {
    let mut de = FrameDecoder::new(input);
    let mut out = Vec::new();
    de.read_to_end(&mut out)?;
    Ok(out)
}
arthurprs commented 3 years ago

I think the coverage setup is not working as expected. Specifically, I don't think it's merging the results of the multiple runs (w/ different features).

PSeitz commented 3 years ago

I think the coverage setup is not working as expected. Specifically, I don't think it's merging the results of the multiple runs (w/ different features).

Yes it doesn't correctly aggregate runs with different feature flags, you can ignore it

arthurprs commented 3 years ago

@PSeitz Thank you for you patience. I think it's ready for review. I added as much testing as I could so I'm cautiously optimistic it's working.

PSeitz commented 3 years ago

@arthurprs thank you, really impressive job there. reviewing all of it may take a little.

PSeitz commented 3 years ago

Looks really good so far.

Did I understand correctly, that in the linked mode, the ext_dict feature is used to give access to the previous bytes?

arthurprs commented 3 years ago

Did I understand correctly, that in the linked mode, the ext_dict feature is used to give access to the previous bytes?

Exactly. When data has to go to the beginning of the buffer again we can use ext_dict to reference data that is still part of the compressor window but is not continuous memory-wise.

PSeitz commented 3 years ago

Thank you for your patience, really exceptional quality. One thing that's missing I think are tests for the features exposed via FrameInfo, content_size, block_size and the two checksums. Currently only defaults are tested?

arthurprs commented 3 years ago

Ahh, yes :thinking: Only block mode is tested in depth. I'll try to add a few tests.

Related to this. Can you enable the github actions for this PR? It says

1 workflow awaiting approval
First-time contributors need a maintainer to approve running workflows.
PSeitz commented 3 years ago

Related to this. Can you enable the github actions for this PR? It says

1 workflow awaiting approval
First-time contributors need a maintainer to approve running workflows.

It's quite annoying, I constantly need to click on approve and run, until one PR is merged.

After a contributor has at least one pull request merged into a project's repository, any future pull requests from that contributor's fork will automatically run workflows.

arthurprs commented 3 years ago

Thank you for nudging me about the tests, there were some horrendous bugs :facepalm: :facepalm: :facepalm: :facepalm: :facepalm:

PSeitz commented 3 years ago

Thank you for nudging me about the tests, there were some horrendous bugs facepalm facepalm facepalm facepalm facepalm

anything that is not covered by tests doesn't work or will be broken eventually :)

PSeitz commented 3 years ago

Time to merge finally :D Really great job!

arthurprs commented 3 years ago

Thank you :tada: Looking forwarding to using this very soon.

PSeitz commented 3 years ago

released with 0.8.0