chore: Use twox-hash 2.0 xxhash64 oneshot api instead of custom implementation

apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator

https://datafusion.apache.org/comet

Apache License 2.0

823 stars 163 forks source link

chore: Use twox-hash 2.0 xxhash64 oneshot api instead of custom implementation #1041

Closed NoeB closed 3 weeks ago

NoeB commented 3 weeks ago

Which issue does this PR close?

Closes #1032

Rationale for this change

What changes are included in this PR?

Use two-hash 2.0 onshot api instead of custom hash implementation
Upgrades Rust to 1.81.0 (is required by twox-hash Version 2.0)

How are these changes tested?

NoeB commented 3 weeks ago

I am unsure if license.txt and Notice.txt also need to be updated The references regarding the code from twox-hash got introduced with #575

NoeB commented 3 weeks ago

Cargo Bench results:

hash/xxhash64/8192      time:   [414.63 µs 415.13 µs 415.67 µs]
                        change: [-6.6364% -6.3063% -5.9526%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

andygrove commented 3 weeks ago

I am unsure if license.txt and Notice.txt also need to be updated The references regarding the code from twox-hash got introduced with #575

Yes, the twox_hash references should be removed from those files as part of this PR

NoeB commented 3 weeks ago

How should I proceed with the pipeline failure which comes from a new clippy rule (manual_pattern_char_comparison) introduced in 1.81.0? Should I apply the suggestion or create a pull request with the upgrade to 1.81.0 and apply it there or do you have a different suggestion?

NoeB commented 3 weeks ago

I am unsure if license.txt and Notice.txt also need to be updated The references regarding the code from twox-hash got introduced with #575

Yes, the twox_hash references should be removed from those files as part of this PR

I have added a new commit which removes the references

andygrove commented 3 weeks ago

How should I proceed with the pipeline failure which comes from a new clippy rule (manual_pattern_char_comparison) introduced in 1.81.0? Should I apply the suggestion or create a pull request with the upgrade to 1.81.0 and apply it there or do you have a different suggestion?

I think it would be fine to apply the change in this PR.

codecov-commenter commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 34.47%. Comparing base (845b654) to head (cae64d1). Report is 2 commits behind head on main.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1041 +/- ## ========================================= Coverage 34.46% 34.47% - Complexity 888 889 +1 ========================================= Files 113 113 Lines 43580 43580 Branches 9658 9658 ========================================= + Hits 15021 15024 +3 + Misses 25507 25505 -2 + Partials 3052 3051 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

andygrove commented 3 weeks ago

I got slightly different cargo bench results, but I saw no regression in overall TPC-H performance.

hash/xxhash64/8192      time:   [306.49 µs 307.37 µs 308.27 µs]
                        change: [+5.5768% +5.9235% +6.2740%] (p = 0.00 < 0.05)
                        Performance has regressed.

NoeB commented 3 weeks ago

Interesting, I just reran the bench and I get the same improvements as before. Should I compare the library implementation with the DataFusion custom one to see if there is a difference?

Shall I add the following command to the docs: cd native && cargo clippy --color=never --all-targets --workspace -- -D warnings in the ( Submitting a PR section)? I think it would have saved me several CI runs.

andygrove commented 3 weeks ago

Interesting, I just reran the bench and I get the same improvements as before. Should I compare the library implementation with the DataFusion custom one to see if there is a difference?

It is possible that the results could vary depending on hardware. Out of interest what platform are you testing on? I am testing on an AMD Ryzen 9 CPU with Linux.

5% is a small price to pay to remove the custom implementation and go back to using the dependency, IMO.

Shall I add the following command to the docs: cd native && cargo clippy --color=never --all-targets --workspace -- -D warnings in the ( Submitting a PR section)? I think it would have saved me several CI runs.

That would be great as a separate PR, thank you.

NoeB commented 3 weeks ago

Thank you @andygrove for reviewing the PR and starting the CI. I tested it on an Apple M1 Max running macOS Sequoia.

That would be great as a separate PR, thank you.

I will do that later