cespare / xxhash

A Go implementation of the 64-bit xxHash algorithm (XXH64)
MIT License
1.79k stars 123 forks source link

Replace ADDQ $c, r by LEAQ c(r), r in amd64 assembly #39

Closed greatroar closed 4 years ago

greatroar commented 4 years ago

On older CPUs such as i7-3770K, this gives a small speedup:

name                                  old speed      new speed      delta
Hashes/xxhash,direct,bytes,n=5B-8      854MB/s ± 0%   851MB/s ± 1%    ~     (p=0.720 n=9+10)
Hashes/xxhash,direct,string,n=5B-8     569MB/s ± 1%   571MB/s ± 1%    ~     (p=0.123 n=10+10)
Hashes/xxhash,digest,bytes,n=5B-8      234MB/s ± 1%   235MB/s ± 1%  +0.43%  (p=0.029 n=10+10)
Hashes/xxhash,digest,string,n=5B-8     210MB/s ± 0%   211MB/s ± 1%  +0.64%  (p=0.000 n=9+10)
Hashes/xxhash,direct,bytes,n=100B-8   5.57GB/s ± 0%  5.68GB/s ± 1%  +1.99%  (p=0.000 n=7+10)
Hashes/xxhash,direct,string,n=100B-8  4.91GB/s ± 0%  5.05GB/s ± 1%  +2.96%  (p=0.000 n=10+10)
Hashes/xxhash,digest,bytes,n=100B-8   3.09GB/s ± 0%  3.15GB/s ± 0%  +1.82%  (p=0.000 n=10+10)
Hashes/xxhash,digest,string,n=100B-8  2.58GB/s ± 0%  2.61GB/s ± 0%  +0.94%  (p=0.000 n=10+10)
Hashes/xxhash,direct,bytes,n=4KB-8    14.5GB/s ± 0%  14.5GB/s ± 1%  +0.30%  (p=0.017 n=10+9)
Hashes/xxhash,direct,string,n=4KB-8   14.3GB/s ± 0%  14.4GB/s ± 1%  +1.08%  (p=0.000 n=7+10)
Hashes/xxhash,digest,bytes,n=4KB-8    13.4GB/s ± 1%  13.6GB/s ± 1%  +1.24%  (p=0.000 n=10+10)
Hashes/xxhash,digest,string,n=4KB-8   13.2GB/s ± 0%  13.5GB/s ± 1%  +2.48%  (p=0.000 n=9+10)
Hashes/xxhash,direct,bytes,n=10MB-8   13.1GB/s ± 1%  13.2GB/s ± 1%  +1.12%  (p=0.000 n=10+10)
Hashes/xxhash,direct,string,n=10MB-8  13.1GB/s ± 0%  13.2GB/s ± 0%  +0.50%  (p=0.000 n=10+10)
Hashes/xxhash,digest,bytes,n=10MB-8   13.1GB/s ± 1%  13.1GB/s ± 1%    ~     (p=0.053 n=9+10)
Hashes/xxhash,digest,string,n=10MB-8  13.1GB/s ± 1%  13.2GB/s ± 0%  +0.61%  (p=0.008 n=10+9)
cespare commented 4 years ago

Hi @greatroar! Thanks for the contribution.

I wasn't familiar with this optimization, so I did some background reading (especially the comments in this SO answer). The comments seem to suggest that the optimization may have made more sense on older CPUs but not on new ones.

I compared the LEA and ADD instructions for a few architectures on Agner Fog's tables; in particular, I looked at Skylake, Coffee Lake, and Zen 2. I think these are the relevant numbers:

LEA r,m, 2 address components:

arch latency reciprocal throughput execution ports
Skylake 1 0.5 1, 5
Coffee Lake 1 0.5 1, 5
Zen 2 1-2 0.25 ?

ADD r,i:

arch latency reciprocal throughput execution ports
Skylake 1 0.25 0, 1, 5, 6
Coffee Lake 1 0.25 0, 1, 5, 6
Zen 2 1 0.3 ?

If I'm interpreting this correctly, it seems like the theoretical advantage doesn't apply to these newer architectures, and in particular the original reason for this optimization (reduce ALU contention, basically) doesn't apply to the Intel architectures which can actually run the ADD on 4 different execution ports (a superset of the ones that can handle the LEA).

I tried the benchmarks myself on a couple of AWS machines. Here's a Xeon Platinum 8124M (Skylake):

name                                  old speed      new speed      delta
Hashes/xxhash,direct,bytes,n=5B-2      911MB/s ± 0%   909MB/s ± 0%  -0.26%  (p=0.000 n=10+10)
Hashes/xxhash,direct,string,n=5B-2     614MB/s ± 0%   612MB/s ± 0%  -0.35%  (p=0.000 n=9+9)
Hashes/xxhash,digest,bytes,n=5B-2      219MB/s ± 0%   219MB/s ± 0%    ~     (p=0.470 n=10+10)
Hashes/xxhash,digest,string,n=5B-2     196MB/s ± 0%   196MB/s ± 0%    ~     (p=0.419 n=10+10)
Hashes/xxhash,direct,bytes,n=100B-2   5.70GB/s ± 0%  5.70GB/s ± 0%  +0.14%  (p=0.000 n=10+10)
Hashes/xxhash,direct,string,n=100B-2  4.82GB/s ± 0%  4.81GB/s ± 0%  -0.16%  (p=0.000 n=10+10)
Hashes/xxhash,digest,bytes,n=100B-2   3.20GB/s ± 0%  3.17GB/s ± 0%  -1.00%  (p=0.000 n=10+10)
Hashes/xxhash,digest,string,n=100B-2  2.88GB/s ± 0%  2.86GB/s ± 0%  -0.49%  (p=0.000 n=8+10)
Hashes/xxhash,direct,bytes,n=4KB-2    13.0GB/s ± 0%  13.0GB/s ± 0%  -0.17%  (p=0.000 n=10+9)
Hashes/xxhash,direct,string,n=4KB-2   12.9GB/s ± 0%  12.9GB/s ± 0%  +0.01%  (p=0.035 n=9+10)
Hashes/xxhash,digest,bytes,n=4KB-2    12.4GB/s ± 0%  12.5GB/s ± 0%  +0.21%  (p=0.000 n=10+10)
Hashes/xxhash,digest,string,n=4KB-2   12.3GB/s ± 0%  12.3GB/s ± 0%  -0.13%  (p=0.000 n=10+10)
Hashes/xxhash,direct,bytes,n=10MB-2   13.2GB/s ± 0%  13.2GB/s ± 0%    ~     (p=0.356 n=10+9)
Hashes/xxhash,direct,string,n=10MB-2  13.2GB/s ± 1%  13.1GB/s ± 1%  -0.88%  (p=0.000 n=9+10)
Hashes/xxhash,digest,bytes,n=10MB-2   13.2GB/s ± 0%  13.1GB/s ± 0%  -0.42%  (p=0.011 n=10+10)
Hashes/xxhash,digest,string,n=10MB-2  13.2GB/s ± 0%  13.1GB/s ± 1%  -0.60%  (p=0.008 n=9+10)

Not much of a difference, but in general it looks like things got slightly worse.

Here's an AMD EPYC 7R32 (Zen2 Rome):

name                                  old speed      new speed      delta
Hashes/xxhash,direct,bytes,n=5B-2     1.09GB/s ± 1%  1.10GB/s ± 0%    ~     (p=0.077 n=9+9)
Hashes/xxhash,direct,string,n=5B-2     647MB/s ± 0%   661MB/s ± 1%  +2.17%  (p=0.000 n=9+10)
Hashes/xxhash,digest,bytes,n=5B-2      179MB/s ± 0%   179MB/s ± 0%    ~     (p=0.527 n=10+8)
Hashes/xxhash,digest,string,n=5B-2     163MB/s ± 0%   163MB/s ± 0%    ~     (p=0.282 n=8+10)
Hashes/xxhash,direct,bytes,n=100B-2   5.78GB/s ± 0%  5.78GB/s ± 0%    ~     (p=0.633 n=10+8)
Hashes/xxhash,direct,string,n=100B-2  4.82GB/s ± 1%  4.73GB/s ± 1%  -1.97%  (p=0.000 n=10+10)
Hashes/xxhash,digest,bytes,n=100B-2   2.98GB/s ± 0%  2.94GB/s ± 0%  -1.30%  (p=0.000 n=9+10)
Hashes/xxhash,digest,string,n=100B-2  2.84GB/s ± 0%  2.81GB/s ± 0%  -1.29%  (p=0.000 n=10+10)
Hashes/xxhash,direct,bytes,n=4KB-2    12.4GB/s ± 0%  12.2GB/s ± 1%  -1.40%  (p=0.000 n=10+8)
Hashes/xxhash,direct,string,n=4KB-2   12.3GB/s ± 0%  12.2GB/s ± 0%  -0.59%  (p=0.000 n=9+9)
Hashes/xxhash,digest,bytes,n=4KB-2    12.0GB/s ± 0%  11.7GB/s ± 1%  -2.62%  (p=0.000 n=10+10)
Hashes/xxhash,digest,string,n=4KB-2   11.9GB/s ± 0%  11.5GB/s ± 0%  -3.01%  (p=0.000 n=10+10)
Hashes/xxhash,direct,bytes,n=10MB-2   13.1GB/s ± 0%  12.7GB/s ± 0%  -3.07%  (p=0.000 n=9+9)
Hashes/xxhash,direct,string,n=10MB-2  13.0GB/s ± 1%  12.7GB/s ± 0%  -2.50%  (p=0.000 n=10+9)
Hashes/xxhash,digest,bytes,n=10MB-2   13.1GB/s ± 0%  12.7GB/s ± 0%  -3.11%  (p=0.000 n=9+9)
Hashes/xxhash,digest,string,n=10MB-2  13.1GB/s ± 0%  12.7GB/s ± 0%  -3.18%  (p=0.000 n=9+10)

This shows a more noticeable slowdown.

On the other hand, when I compile a small C program with a literal addition I see that gcc -O2 and clang -O2 both use LEA rather than ADD.

Go uses ADD, but the Go compiler obviously has had much less optimization work put in than those other ones.

I'm not sure whether to interpret this as "the compilers know key stuff that I don't" or "the compiler optimizations were mostly written years ago and may be out of date".

Anyway, given these data, ISTM that while this technique may provide a small boost on older CPUs, it doesn't seem like a clear win on new ones and may be slightly detrimental. It also makes the code harder to understand. Therefore, I'm leaning toward not including this change. Thoughts? (Other stuff I should read?)

greatroar commented 4 years ago

The Go compiler tends to be pretty knowledgeable about the latest amd64 models, whereas GCC and Clang target (I believe) some "average-age" CPU by default and provide -mcpu to get optimized code for specific models. If you're willing to put in a final test, you could compile your C program with -mcpu set to the latest CPU models and see what that produces. Otherwise, feel free to close.

cespare commented 4 years ago

The hypothesis makes sense to me, but various combinations of -mcpu, -march, and -mtune to try to force code generation optimized for various recent Intel generations didn't change the LEA to an ADD (for gcc or clang). Possibly I'm holding it wrong.

Thanks for the interesting discussion topic. I learned a few things :)