Closed cespare closed 3 years ago
I ran xxhashbench on this, and it doesn't quite give the speedup that the assembler version gives.
Linux/amd64, old=this PR, new=#42:
name old time/op new time/op delta
Hashes/xxhash,direct,bytes,n=5B-8 5.95ns ± 2% 5.92ns ± 1% -0.39% (p=0.028 n=18+18)
Hashes/xxhash,direct,string,n=5B-8 7.92ns ± 1% 5.91ns ± 0% -25.39% (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8 19.2ns ± 2% 18.9ns ± 1% -1.44% (p=0.000 n=19+18)
Hashes/xxhash,direct,string,n=100B-8 19.5ns ± 1% 17.7ns ± 1% -9.16% (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8 278ns ± 1% 276ns ± 1% -0.67% (p=0.000 n=18+16)
Hashes/xxhash,direct,string,n=4KB-8 280ns ± 1% 276ns ± 1% -1.29% (p=0.000 n=20+20)
Hashes/xxhash,direct,bytes,n=10MB-8 769µs ± 1% 766µs ± 1% -0.40% (p=0.020 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8 772µs ± 2% 775µs ± 2% ~ (p=0.191 n=19+19)
name old speed new speed delta
Hashes/xxhash,direct,bytes,n=5B-8 840MB/s ± 2% 844MB/s ± 1% +0.48% (p=0.017 n=19+18)
Hashes/xxhash,direct,string,n=5B-8 631MB/s ± 1% 846MB/s ± 0% +34.04% (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8 5.20GB/s ± 2% 5.28GB/s ± 2% +1.51% (p=0.000 n=19+19)
Hashes/xxhash,direct,string,n=100B-8 5.12GB/s ± 1% 5.64GB/s ± 2% +10.05% (p=0.000 n=18+20)
Hashes/xxhash,direct,bytes,n=4KB-8 14.4GB/s ± 2% 14.5GB/s ± 1% +0.48% (p=0.002 n=17+17)
Hashes/xxhash,direct,string,n=4KB-8 14.3GB/s ± 1% 14.5GB/s ± 0% +1.53% (p=0.000 n=20+17)
Hashes/xxhash,direct,bytes,n=10MB-8 13.0GB/s ± 1% 13.1GB/s ± 1% +0.40% (p=0.020 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8 13.0GB/s ± 2% 12.9GB/s ± 2% ~ (p=0.191 n=19+19)
My actual input strings are somewhere between the 5 and 100B marks, typically.
Yes, that's because xxhashbench uses indirect calls. What I was getting at in https://github.com/cespare/xxhash/pull/42#pullrequestreview-535076183 and https://github.com/cespare/xxhash/issues/22#issuecomment-730782352 is that I don't think the indirect call performance matters that much.
I intend to delete xxhashbench (or at least rework it to use only direct calls). For now, I think we can focus on the new benchmarks I added in the xxhash package itself.
Sorry, I read that but I was still running the wrong benchmarks. Here's the main package ones cherry-picked onto #42 (old) vs. this PR (new):
name old speed new speed delta
Sum64/4B-8 733MB/s ± 1% 796MB/s ± 0% +8.63% (p=0.000 n=10+10)
Sum64/100B-8 5.44GB/s ± 1% 5.56GB/s ± 1% +2.11% (p=0.000 n=10+10)
Sum64/4KB-8 14.3GB/s ± 1% 14.4GB/s ± 1% +0.88% (p=0.000 n=10+9)
Sum64/10MB-8 12.9GB/s ± 2% 12.9GB/s ± 2% ~ (p=0.605 n=9+9)
Sum64String/4B-8 769MB/s ± 1% 810MB/s ± 1% +5.28% (p=0.000 n=10+9)
Sum64String/100B-8 5.73GB/s ± 1% 5.69GB/s ± 1% -0.80% (p=0.002 n=10+10)
Sum64String/4KB-8 14.5GB/s ± 1% 14.5GB/s ± 1% ~ (p=0.297 n=9+9)
Sum64String/10MB-8 13.1GB/s ± 1% 13.0GB/s ± 1% ~ (p=0.089 n=10+10)
DigestBytes/4B-8 267MB/s ± 0% 266MB/s ± 1% ~ (p=0.079 n=9+10)
DigestBytes/100B-8 3.46GB/s ± 0% 3.46GB/s ± 0% ~ (p=0.546 n=9+9)
DigestBytes/4KB-8 13.7GB/s ± 0% 13.6GB/s ± 0% -0.43% (p=0.000 n=10+10)
DigestBytes/10MB-8 13.0GB/s ± 1% 13.0GB/s ± 0% ~ (p=0.758 n=9+7)
DigestString/4B-8 230MB/s ± 1% 228MB/s ± 2% ~ (p=0.075 n=10+10)
DigestString/100B-8 2.83GB/s ± 1% 2.84GB/s ± 1% ~ (p=0.631 n=10+10)
DigestString/4KB-8 13.4GB/s ± 3% 13.5GB/s ± 1% ~ (p=0.661 n=10+9)
DigestString/10MB-8 13.0GB/s ± 1% 13.0GB/s ± 0% ~ (p=0.074 n=8+9)
So this version is actually quite a bit faster in the common case of direct calls.
With help from @josharian I got a different way to write the conversions that allow both Sum64String
and (*Digest).WriteString
to be inlined. I updated the description and benchmarks.
I'm pretty happy with this for now so I think I'll merge.
This is an alternative approach to #42.
Ideally the compiler would do mid-stack inlining for
Sum64String
since it's a minimal unsafe wrapper aroundSum64
:Unfortunately, the weight the inliner computes is too high. I filed https://github.com/golang/go/issues/42739.
In the meantime, I found some tricks (with help from @josharian) to generate a lower cost that gets us below the threshold value.
Additionally, add tests to confirm that
Sum64String
and(*Digest).WriteString
are inlined.Benchmarks:
And with
-tags purego
:/cc @greatroar