Closed greatroar closed 4 years ago
Hi @greatroar! Thanks for the contribution.
I wasn't familiar with this optimization, so I did some background reading (especially the comments in this SO answer). The comments seem to suggest that the optimization may have made more sense on older CPUs but not on new ones.
I compared the LEA and ADD instructions for a few architectures on Agner Fog's tables; in particular, I looked at Skylake, Coffee Lake, and Zen 2. I think these are the relevant numbers:
LEA r,m, 2 address components:
arch | latency | reciprocal throughput | execution ports |
---|---|---|---|
Skylake | 1 | 0.5 | 1, 5 |
Coffee Lake | 1 | 0.5 | 1, 5 |
Zen 2 | 1-2 | 0.25 | ? |
ADD r,i:
arch | latency | reciprocal throughput | execution ports |
---|---|---|---|
Skylake | 1 | 0.25 | 0, 1, 5, 6 |
Coffee Lake | 1 | 0.25 | 0, 1, 5, 6 |
Zen 2 | 1 | 0.3 | ? |
If I'm interpreting this correctly, it seems like the theoretical advantage doesn't apply to these newer architectures, and in particular the original reason for this optimization (reduce ALU contention, basically) doesn't apply to the Intel architectures which can actually run the ADD on 4 different execution ports (a superset of the ones that can handle the LEA).
I tried the benchmarks myself on a couple of AWS machines. Here's a Xeon Platinum 8124M (Skylake):
name old speed new speed delta
Hashes/xxhash,direct,bytes,n=5B-2 911MB/s ± 0% 909MB/s ± 0% -0.26% (p=0.000 n=10+10)
Hashes/xxhash,direct,string,n=5B-2 614MB/s ± 0% 612MB/s ± 0% -0.35% (p=0.000 n=9+9)
Hashes/xxhash,digest,bytes,n=5B-2 219MB/s ± 0% 219MB/s ± 0% ~ (p=0.470 n=10+10)
Hashes/xxhash,digest,string,n=5B-2 196MB/s ± 0% 196MB/s ± 0% ~ (p=0.419 n=10+10)
Hashes/xxhash,direct,bytes,n=100B-2 5.70GB/s ± 0% 5.70GB/s ± 0% +0.14% (p=0.000 n=10+10)
Hashes/xxhash,direct,string,n=100B-2 4.82GB/s ± 0% 4.81GB/s ± 0% -0.16% (p=0.000 n=10+10)
Hashes/xxhash,digest,bytes,n=100B-2 3.20GB/s ± 0% 3.17GB/s ± 0% -1.00% (p=0.000 n=10+10)
Hashes/xxhash,digest,string,n=100B-2 2.88GB/s ± 0% 2.86GB/s ± 0% -0.49% (p=0.000 n=8+10)
Hashes/xxhash,direct,bytes,n=4KB-2 13.0GB/s ± 0% 13.0GB/s ± 0% -0.17% (p=0.000 n=10+9)
Hashes/xxhash,direct,string,n=4KB-2 12.9GB/s ± 0% 12.9GB/s ± 0% +0.01% (p=0.035 n=9+10)
Hashes/xxhash,digest,bytes,n=4KB-2 12.4GB/s ± 0% 12.5GB/s ± 0% +0.21% (p=0.000 n=10+10)
Hashes/xxhash,digest,string,n=4KB-2 12.3GB/s ± 0% 12.3GB/s ± 0% -0.13% (p=0.000 n=10+10)
Hashes/xxhash,direct,bytes,n=10MB-2 13.2GB/s ± 0% 13.2GB/s ± 0% ~ (p=0.356 n=10+9)
Hashes/xxhash,direct,string,n=10MB-2 13.2GB/s ± 1% 13.1GB/s ± 1% -0.88% (p=0.000 n=9+10)
Hashes/xxhash,digest,bytes,n=10MB-2 13.2GB/s ± 0% 13.1GB/s ± 0% -0.42% (p=0.011 n=10+10)
Hashes/xxhash,digest,string,n=10MB-2 13.2GB/s ± 0% 13.1GB/s ± 1% -0.60% (p=0.008 n=9+10)
Not much of a difference, but in general it looks like things got slightly worse.
Here's an AMD EPYC 7R32 (Zen2 Rome):
name old speed new speed delta
Hashes/xxhash,direct,bytes,n=5B-2 1.09GB/s ± 1% 1.10GB/s ± 0% ~ (p=0.077 n=9+9)
Hashes/xxhash,direct,string,n=5B-2 647MB/s ± 0% 661MB/s ± 1% +2.17% (p=0.000 n=9+10)
Hashes/xxhash,digest,bytes,n=5B-2 179MB/s ± 0% 179MB/s ± 0% ~ (p=0.527 n=10+8)
Hashes/xxhash,digest,string,n=5B-2 163MB/s ± 0% 163MB/s ± 0% ~ (p=0.282 n=8+10)
Hashes/xxhash,direct,bytes,n=100B-2 5.78GB/s ± 0% 5.78GB/s ± 0% ~ (p=0.633 n=10+8)
Hashes/xxhash,direct,string,n=100B-2 4.82GB/s ± 1% 4.73GB/s ± 1% -1.97% (p=0.000 n=10+10)
Hashes/xxhash,digest,bytes,n=100B-2 2.98GB/s ± 0% 2.94GB/s ± 0% -1.30% (p=0.000 n=9+10)
Hashes/xxhash,digest,string,n=100B-2 2.84GB/s ± 0% 2.81GB/s ± 0% -1.29% (p=0.000 n=10+10)
Hashes/xxhash,direct,bytes,n=4KB-2 12.4GB/s ± 0% 12.2GB/s ± 1% -1.40% (p=0.000 n=10+8)
Hashes/xxhash,direct,string,n=4KB-2 12.3GB/s ± 0% 12.2GB/s ± 0% -0.59% (p=0.000 n=9+9)
Hashes/xxhash,digest,bytes,n=4KB-2 12.0GB/s ± 0% 11.7GB/s ± 1% -2.62% (p=0.000 n=10+10)
Hashes/xxhash,digest,string,n=4KB-2 11.9GB/s ± 0% 11.5GB/s ± 0% -3.01% (p=0.000 n=10+10)
Hashes/xxhash,direct,bytes,n=10MB-2 13.1GB/s ± 0% 12.7GB/s ± 0% -3.07% (p=0.000 n=9+9)
Hashes/xxhash,direct,string,n=10MB-2 13.0GB/s ± 1% 12.7GB/s ± 0% -2.50% (p=0.000 n=10+9)
Hashes/xxhash,digest,bytes,n=10MB-2 13.1GB/s ± 0% 12.7GB/s ± 0% -3.11% (p=0.000 n=9+9)
Hashes/xxhash,digest,string,n=10MB-2 13.1GB/s ± 0% 12.7GB/s ± 0% -3.18% (p=0.000 n=9+10)
This shows a more noticeable slowdown.
On the other hand, when I compile a small C program with a literal addition I see that gcc -O2
and clang -O2
both use LEA rather than ADD.
Go uses ADD, but the Go compiler obviously has had much less optimization work put in than those other ones.
I'm not sure whether to interpret this as "the compilers know key stuff that I don't" or "the compiler optimizations were mostly written years ago and may be out of date".
Anyway, given these data, ISTM that while this technique may provide a small boost on older CPUs, it doesn't seem like a clear win on new ones and may be slightly detrimental. It also makes the code harder to understand. Therefore, I'm leaning toward not including this change. Thoughts? (Other stuff I should read?)
The Go compiler tends to be pretty knowledgeable about the latest amd64 models, whereas GCC and Clang target (I believe) some "average-age" CPU by default and provide -mcpu
to get optimized code for specific models. If you're willing to put in a final test, you could compile your C program with -mcpu
set to the latest CPU models and see what that produces. Otherwise, feel free to close.
The hypothesis makes sense to me, but various combinations of -mcpu
, -march
, and -mtune
to try to force code generation optimized for various recent Intel generations didn't change the LEA to an ADD (for gcc or clang). Possibly I'm holding it wrong.
Thanks for the interesting discussion topic. I learned a few things :)
On older CPUs such as i7-3770K, this gives a small speedup: