beacon-biosignals / StableHashTraits.jl

Compute hashes over any Julia object simply and reproducibly
MIT License
9 stars 3 forks source link

Move some hash computations to compile time #30

Closed haberdashPI closed 1 year ago

haberdashPI commented 1 year ago

A final set of optimizations (before releasing 1.1) that moves a number of computations to compile time. In particular a number of hashed values are a function of the type of an object, and have been moved to @generated functions.

This brings most performance within spitting distance of the "raw" hash computations.

This changes the hash values for HashVersion{2}, but these have yet to be included in a release.

Before

12×5 DataFrame
 Row │ benchmark   hash       base        trait       ratio     
     │ SubStrin…   SubStrin…  String      String      Float64   
─────┼──────────────────────────────────────────────────────────
   1 │ structs     crc        70.250 μs   49.386 ms   703.011
   2 │ symbols     crc        12.166 μs   5.328 ms    437.928
   3 │ strings     crc        12.166 μs   4.777 ms    392.686
   4 │ tuples      crc        71.417 μs   10.082 ms   141.177
   5 │ dataframes  crc        70.208 μs   290.167 μs    4.13296
   6 │ numbers     crc        35.167 μs   126.375 μs    3.59357
   7 │ structs     sha256     532.833 μs  60.885 ms   114.266
   8 │ tuples      sha256     533.000 μs  12.937 ms    24.2711
   9 │ symbols     sha256     833.208 μs  7.607 ms      9.13022
  10 │ strings     sha256     833.417 μs  6.600 ms      7.9192
  11 │ dataframes  sha256     532.833 μs  775.917 μs    1.45621
  12 │ numbers     sha256     270.916 μs  374.417 μs    1.38204

After

12×5 DataFrame
 Row │ benchmark   hash       base        trait       ratio     
     │ SubStrin…   SubStrin…  String      String      Float64   
─────┼──────────────────────────────────────────────────────────
   1 │ structs     crc        71.542 μs   1.116 ms    15.6027
   2 │ tuples      crc        71.459 μs   918.917 μs  12.8594
   3 │ dataframes  crc        71.458 μs   257.166 μs   3.59884
   4 │ numbers     crc        35.916 μs   126.666 μs   3.52673
   5 │ symbols     crc        635.875 μs  629.000 μs   0.989188
   6 │ strings     crc        655.500 μs  561.292 μs   0.856281
   7 │ structs     sha256     543.166 μs  3.045 ms     5.60525
   8 │ tuples      sha256     543.125 μs  2.594 ms     4.77668
   9 │ symbols     sha256     1.494 ms    2.264 ms     1.51544
  10 │ strings     sha256     1.484 ms    2.196 ms     1.47992
  11 │ dataframes  sha256     543.125 μs  749.125 μs   1.37929
  12 │ numbers     sha256     270.958 μs  371.833 μs   1.37229
codecov[bot] commented 1 year ago

Codecov Report

Merging #30 (40e2b80) into main (fbc0c1a) will increase coverage by 0.23%. The diff coverage is 98.13%.

@@            Coverage Diff             @@
##             main      #30      +/-   ##
==========================================
+ Coverage   95.75%   95.98%   +0.23%     
==========================================
  Files           1        1              
  Lines         212      274      +62     
==========================================
+ Hits          203      263      +60     
- Misses          9       11       +2     
Files Coverage Δ
src/StableHashTraits.jl 95.98% <98.13%> (+0.23%) :arrow_up:

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

haberdashPI commented 1 year ago

One other question I have is why the reference test files needed to change if the default HashVersion{1} is still being used and these changes are non-breaking?

Just to address this before I take a look through your other comments.

I have reference tests for each HashVersion: I want to test that given a particular hash version the hash value does not change for a given object. The function test_hash is defined on lines 31-32 of runtests.jl such that it will use the context set in the for loop.

The reference tests for HashVersion{1} should not change, since those exist in the released version of StableHashTraits. HashVersion{2} was introduced in #29, and has not been included in a release, so it is safe to change the references tests there.

haberdashPI commented 1 year ago

To elaborate on my comment above: in the reference tests, ref01_2_crc is reference test 1, using HashVersion{2} with a crc32c hash function. You will find that none of the reference tests matching ref[nun]_1_[hash] have changed, confirming that no HashVersion{1} return values have changed in the tests.