Allow Sum64String and (*Digest).WriteString to be inlined

cespare commented 3 years ago

This is an alternative approach to #42.

Ideally the compiler would do mid-stack inlining for Sum64String since it's a minimal unsafe wrapper around Sum64:

func Sum64String(s string) uint64 {
    var b []byte
    bh := (*reflect.SliceHeader)(unsafe.Pointer(&b))
    bh.Data = (*reflect.StringHeader)(unsafe.Pointer(&s)).Data
    bh.Len = len(s)
    bh.Cap = len(s)
    return Sum64(b)
}

Unfortunately, the weight the inliner computes is too high. I filed https://github.com/golang/go/issues/42739.

In the meantime, I found some tricks (with help from @josharian) to generate a lower cost that gets us below the threshold value.

Additionally, add tests to confirm that Sum64String and (*Digest).WriteString are inlined.

Benchmarks:

name                  old time/op    new time/op    delta
Sum64String/4B-12       4.78ns ± 1%    3.57ns ± 4%  -25.27%  (p=0.000 n=8+10)
Sum64String/100B-12     14.5ns ± 1%    12.9ns ± 0%  -10.76%  (p=0.000 n=9+10)
Sum64String/4KB-12       229ns ± 0%     229ns ± 1%     ~     (p=0.395 n=7+10)
Sum64String/10MB-12      628µs ± 1%     630µs ± 2%     ~     (p=1.000 n=9+10)
DigestString/4B-12      11.4ns ± 1%     9.7ns ± 1%  -14.95%  (p=0.000 n=10+10)
DigestString/100B-12    23.6ns ± 1%    21.3ns ± 2%   -9.65%  (p=0.000 n=10+10)
DigestString/4KB-12      241ns ± 1%     239ns ± 0%   -0.67%  (p=0.001 n=10+7)
DigestString/10MB-12     627µs ± 1%     628µs ± 1%     ~     (p=0.631 n=10+10)

name                  old speed      new speed      delta
Sum64String/4B-12      837MB/s ± 1%  1124MB/s ± 2%  +34.42%  (p=0.000 n=10+9)
Sum64String/100B-12   6.88GB/s ± 2%  7.72GB/s ± 1%  +12.16%  (p=0.000 n=10+10)
Sum64String/4KB-12    17.5GB/s ± 0%  17.5GB/s ± 1%     ~     (p=0.408 n=8+10)
Sum64String/10MB-12   15.9GB/s ± 1%  15.9GB/s ± 2%     ~     (p=1.000 n=9+10)
DigestString/4B-12     350MB/s ± 1%   411MB/s ± 1%  +17.55%  (p=0.000 n=10+10)
DigestString/100B-12  4.23GB/s ± 1%  4.69GB/s ± 1%  +10.84%  (p=0.000 n=10+9)
DigestString/4KB-12   16.6GB/s ± 1%  16.7GB/s ± 0%   +0.67%  (p=0.001 n=10+8)
DigestString/10MB-12  16.0GB/s ± 1%  15.9GB/s ± 1%     ~     (p=0.631 n=10+10)

And with -tags purego:

name                  old time/op    new time/op    delta
Sum64String/4B-12       5.57ns ± 1%    4.22ns ± 1%  -24.14%  (p=0.000 n=10+9)
Sum64String/100B-12     16.0ns ± 1%    14.8ns ± 0%   -7.27%  (p=0.000 n=10+6)
Sum64String/4KB-12       327ns ± 2%     325ns ± 1%     ~     (p=0.050 n=10+10)
Sum64String/10MB-12      866µs ± 3%     856µs ± 0%   -1.05%  (p=0.002 n=9+8)
DigestString/4B-12      11.2ns ± 1%    10.0ns ± 1%  -10.90%  (p=0.000 n=10+9)
DigestString/100B-12    25.5ns ± 1%    22.8ns ± 0%  -10.62%  (p=0.000 n=10+9)
DigestString/4KB-12      342ns ± 1%     340ns ± 1%   -0.56%  (p=0.018 n=9+10)
DigestString/10MB-12     877µs ± 1%     878µs ± 2%     ~     (p=0.400 n=10+9)

name                  old speed      new speed      delta
Sum64String/4B-12      718MB/s ± 1%   947MB/s ± 1%  +31.82%  (p=0.000 n=10+9)
Sum64String/100B-12   6.26GB/s ± 1%  6.75GB/s ± 1%   +7.81%  (p=0.000 n=10+10)
Sum64String/4KB-12    12.2GB/s ± 2%  12.3GB/s ± 1%   +0.70%  (p=0.022 n=10+9)
Sum64String/10MB-12   11.6GB/s ± 3%  11.7GB/s ± 0%   +1.05%  (p=0.002 n=9+8)
DigestString/4B-12     357MB/s ± 1%   401MB/s ± 1%  +12.32%  (p=0.000 n=10+9)
DigestString/100B-12  3.93GB/s ± 1%  4.40GB/s ± 0%  +11.95%  (p=0.000 n=10+9)
DigestString/4KB-12   11.7GB/s ± 1%  11.8GB/s ± 1%   +0.68%  (p=0.011 n=10+10)
DigestString/10MB-12  11.4GB/s ± 1%  11.4GB/s ± 2%     ~     (p=0.400 n=10+9)

/cc @greatroar

greatroar commented 3 years ago

I ran xxhashbench on this, and it doesn't quite give the speedup that the assembler version gives.

Linux/amd64, old=this PR, new=#42:

name                                  old time/op    new time/op    delta
Hashes/xxhash,direct,bytes,n=5B-8       5.95ns ± 2%    5.92ns ± 1%   -0.39%  (p=0.028 n=18+18)
Hashes/xxhash,direct,string,n=5B-8      7.92ns ± 1%    5.91ns ± 0%  -25.39%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8     19.2ns ± 2%    18.9ns ± 1%   -1.44%  (p=0.000 n=19+18)
Hashes/xxhash,direct,string,n=100B-8    19.5ns ± 1%    17.7ns ± 1%   -9.16%  (p=0.000 n=17+20)
Hashes/xxhash,direct,bytes,n=4KB-8       278ns ± 1%     276ns ± 1%   -0.67%  (p=0.000 n=18+16)
Hashes/xxhash,direct,string,n=4KB-8      280ns ± 1%     276ns ± 1%   -1.29%  (p=0.000 n=20+20)
Hashes/xxhash,direct,bytes,n=10MB-8      769µs ± 1%     766µs ± 1%   -0.40%  (p=0.020 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8     772µs ± 2%     775µs ± 2%     ~     (p=0.191 n=19+19)

name                                  old speed      new speed      delta
Hashes/xxhash,direct,bytes,n=5B-8      840MB/s ± 2%   844MB/s ± 1%   +0.48%  (p=0.017 n=19+18)
Hashes/xxhash,direct,string,n=5B-8     631MB/s ± 1%   846MB/s ± 0%  +34.04%  (p=0.000 n=19+19)
Hashes/xxhash,direct,bytes,n=100B-8   5.20GB/s ± 2%  5.28GB/s ± 2%   +1.51%  (p=0.000 n=19+19)
Hashes/xxhash,direct,string,n=100B-8  5.12GB/s ± 1%  5.64GB/s ± 2%  +10.05%  (p=0.000 n=18+20)
Hashes/xxhash,direct,bytes,n=4KB-8    14.4GB/s ± 2%  14.5GB/s ± 1%   +0.48%  (p=0.002 n=17+17)
Hashes/xxhash,direct,string,n=4KB-8   14.3GB/s ± 1%  14.5GB/s ± 0%   +1.53%  (p=0.000 n=20+17)
Hashes/xxhash,direct,bytes,n=10MB-8   13.0GB/s ± 1%  13.1GB/s ± 1%   +0.40%  (p=0.020 n=18+19)
Hashes/xxhash,direct,string,n=10MB-8  13.0GB/s ± 2%  12.9GB/s ± 2%     ~     (p=0.191 n=19+19)

My actual input strings are somewhere between the 5 and 100B marks, typically.

cespare commented 3 years ago

Yes, that's because xxhashbench uses indirect calls. What I was getting at in https://github.com/cespare/xxhash/pull/42#pullrequestreview-535076183 and https://github.com/cespare/xxhash/issues/22#issuecomment-730782352 is that I don't think the indirect call performance matters that much.

I intend to delete xxhashbench (or at least rework it to use only direct calls). For now, I think we can focus on the new benchmarks I added in the xxhash package itself.

greatroar commented 3 years ago

Sorry, I read that but I was still running the wrong benchmarks. Here's the main package ones cherry-picked onto #42 (old) vs. this PR (new):

name                 old speed      new speed      delta
Sum64/4B-8            733MB/s ± 1%   796MB/s ± 0%  +8.63%  (p=0.000 n=10+10)
Sum64/100B-8         5.44GB/s ± 1%  5.56GB/s ± 1%  +2.11%  (p=0.000 n=10+10)
Sum64/4KB-8          14.3GB/s ± 1%  14.4GB/s ± 1%  +0.88%  (p=0.000 n=10+9)
Sum64/10MB-8         12.9GB/s ± 2%  12.9GB/s ± 2%    ~     (p=0.605 n=9+9)
Sum64String/4B-8      769MB/s ± 1%   810MB/s ± 1%  +5.28%  (p=0.000 n=10+9)
Sum64String/100B-8   5.73GB/s ± 1%  5.69GB/s ± 1%  -0.80%  (p=0.002 n=10+10)
Sum64String/4KB-8    14.5GB/s ± 1%  14.5GB/s ± 1%    ~     (p=0.297 n=9+9)
Sum64String/10MB-8   13.1GB/s ± 1%  13.0GB/s ± 1%    ~     (p=0.089 n=10+10)
DigestBytes/4B-8      267MB/s ± 0%   266MB/s ± 1%    ~     (p=0.079 n=9+10)
DigestBytes/100B-8   3.46GB/s ± 0%  3.46GB/s ± 0%    ~     (p=0.546 n=9+9)
DigestBytes/4KB-8    13.7GB/s ± 0%  13.6GB/s ± 0%  -0.43%  (p=0.000 n=10+10)
DigestBytes/10MB-8   13.0GB/s ± 1%  13.0GB/s ± 0%    ~     (p=0.758 n=9+7)
DigestString/4B-8     230MB/s ± 1%   228MB/s ± 2%    ~     (p=0.075 n=10+10)
DigestString/100B-8  2.83GB/s ± 1%  2.84GB/s ± 1%    ~     (p=0.631 n=10+10)
DigestString/4KB-8   13.4GB/s ± 3%  13.5GB/s ± 1%    ~     (p=0.661 n=10+9)
DigestString/10MB-8  13.0GB/s ± 1%  13.0GB/s ± 0%    ~     (p=0.074 n=8+9)

So this version is actually quite a bit faster in the common case of direct calls.

cespare commented 3 years ago

With help from @josharian I got a different way to write the conversions that allow both Sum64String and (*Digest).WriteString to be inlined. I updated the description and benchmarks.

I'm pretty happy with this for now so I think I'll merge.

cespare / xxhash

Allow Sum64String and (*Digest).WriteString to be inlined #50