cespare / xxhash

A Go implementation of the 64-bit xxHash algorithm (XXH64)
MIT License
1.79k stars 123 forks source link

Assembler implementation for arm64 #51

Closed greatroar closed 1 year ago

greatroar commented 3 years ago

Fixes #45, but I haven't yet benchmarked this on actual ARM hardware. On amd64 with qemu, I get the following results after cherry-picking #50 (I have a version that includes a variant of #42 as well):

name                 old speed      new speed      delta
Sum64/4B-8            104MB/s ± 1%   142MB/s ± 1%  +36.00%  (p=0.000 n=10+10)
Sum64/100B-8         1.51GB/s ± 2%  1.97GB/s ± 1%  +30.42%  (p=0.000 n=10+10)
Sum64/4KB-8          4.66GB/s ± 1%  5.48GB/s ± 1%  +17.56%  (p=0.000 n=9+9)
Sum64/10MB-8         4.79GB/s ± 0%  5.45GB/s ± 0%  +13.78%  (p=0.000 n=10+9)
Sum64String/4B-8      108MB/s ± 1%   149MB/s ± 1%  +38.39%  (p=0.000 n=10+8)
Sum64String/100B-8   1.59GB/s ± 2%  2.01GB/s ± 1%  +25.84%  (p=0.000 n=9+9)
Sum64String/4KB-8    4.67GB/s ± 1%  5.54GB/s ± 1%  +18.76%  (p=0.000 n=9+10)
Sum64String/10MB-8   4.78GB/s ± 1%  5.46GB/s ± 0%  +14.08%  (p=0.000 n=9+10)
DigestBytes/4B-8     37.4MB/s ± 1%  37.7MB/s ± 1%     ~     (p=0.093 n=10+10)
DigestBytes/100B-8    634MB/s ± 0%   657MB/s ± 0%   +3.57%  (p=0.000 n=10+9)
DigestBytes/4KB-8    4.33GB/s ± 0%  4.87GB/s ± 1%  +12.50%  (p=0.000 n=10+9)
DigestBytes/10MB-8   4.90GB/s ± 0%  5.47GB/s ± 1%  +11.54%  (p=0.000 n=10+10)
DigestString/4B-8    31.5MB/s ± 1%  31.6MB/s ± 1%     ~     (p=0.447 n=10+10)
DigestString/100B-8   561MB/s ± 1%   576MB/s ± 1%   +2.67%  (p=0.000 n=10+9)
DigestString/4KB-8   4.22GB/s ± 1%  4.72GB/s ± 0%  +11.77%  (p=0.000 n=9+9)
DigestString/10MB-8  4.90GB/s ± 0%  5.47GB/s ± 0%  +11.56%  (p=0.000 n=9+9)

I've yet to try if NEON instructions provide any further speedup. The code assumes that unaligned access is safe; there's a system control register flag that can forbid unaligned access, but the Go compiler also assumes it's turned off.

cespare commented 3 years ago

Awesome! It might take me a while to find the time to review this because I'm not too familiar with arm64 yet.

I haven't yet benchmarked this on actual ARM hardware.

I don't have physical access to hardware, but I can test on graviton2 instances in EC2 later if that helps.

cespare commented 3 years ago

@greatroar sorry for the delay. I forgot about this for a while. I started taking a look but it will take me a bit to get up to speed on arm assembly. I'm pretty swamped with work atm so it may be a bit before I finish reviewing. Feel free to ping me if you haven't heard back in a couple of weeks.

greatroar commented 2 years ago

Rebased to fix the merge commit. I've also finally found time to run this on actual hardware, a Raspberry Pi 4B. Benchmark results:

name                 old speed      new speed      delta
Sum64/4B-4            180MB/s ± 0%   251MB/s ± 0%  +39.13%  (p=0.000 n=10+10)
Sum64/100B-4          994MB/s ± 0%  1135MB/s ± 0%  +14.25%  (p=0.000 n=10+9)
Sum64/4KB-4          1.92GB/s ± 0%  1.93GB/s ± 0%   +0.43%  (p=0.000 n=10+10)
Sum64/10MB-4         1.88GB/s ± 0%  1.88GB/s ± 0%     ~     (p=0.754 n=10+10)
Sum64String/4B-4      133MB/s ± 4%   228MB/s ± 0%  +71.37%  (p=0.000 n=10+9)
Sum64String/100B-4    949MB/s ± 0%  1103MB/s ± 0%  +16.17%  (p=0.000 n=10+10)
Sum64String/4KB-4    1.92GB/s ± 0%  1.93GB/s ± 0%   +0.40%  (p=0.000 n=9+8)
Sum64String/10MB-4   1.88GB/s ± 0%  1.88GB/s ± 0%     ~     (p=0.146 n=10+8)
DigestBytes/4B-4     61.9MB/s ± 0%  61.9MB/s ± 0%     ~     (p=0.158 n=10+9)
DigestBytes/100B-4    695MB/s ± 0%   719MB/s ± 0%   +3.37%  (p=0.000 n=10+10)
DigestBytes/4KB-4    1.89GB/s ± 0%  1.90GB/s ± 0%   +0.43%  (p=0.000 n=9+10)
DigestBytes/10MB-4   1.88GB/s ± 0%  1.89GB/s ± 0%   +0.92%  (p=0.000 n=10+9)
DigestString/4B-4    58.9MB/s ± 0%  58.5MB/s ± 1%   -0.60%  (p=0.032 n=8+10)
DigestString/100B-4   669MB/s ± 0%   696MB/s ± 1%   +4.05%  (p=0.000 n=10+10)
DigestString/4KB-4   1.89GB/s ± 0%  1.89GB/s ± 0%   +0.34%  (p=0.000 n=10+10)
DigestString/10MB-4  1.88GB/s ± 0%  1.89GB/s ± 0%   +0.90%  (p=0.000 n=10+10)

I'll admit this is disappointing, but there is a speedup for small inputs. I find it hard to explain why the speedup stops at 4KiB. This CPU has a 32KiB L1 cache, so it shouldn't be running at L2 speed for that input size.

cespare commented 2 years ago

Hey @greatroar, sorry I still haven't had time to review the code. It's still on my radar.

I did have a few minutes to spin up an EC2 instance to try the benchmarks on a beefier arm64 chip (I used a c6g.metal instance). The results are similar to your raspberry pi results but even more pronounced:

name                  old speed      new speed      delta
Sum64/4B-64            435MB/s ± 1%   680MB/s ± 0%  +56.47%  (p=0.000 n=10+10)
Sum64/100B-64         2.01GB/s ± 0%  2.07GB/s ± 0%   +3.17%  (p=0.000 n=9+10)
Sum64/4KB-64          3.28GB/s ± 0%  3.28GB/s ± 0%   -0.03%  (p=0.000 n=8+8)
Sum64/10MB-64         3.33GB/s ± 0%  3.33GB/s ± 0%   -0.01%  (p=0.000 n=9+9)
Sum64String/4B-64      373MB/s ± 0%   501MB/s ± 2%  +34.29%  (p=0.000 n=8+10)
Sum64String/100B-64   1.99GB/s ± 0%  2.08GB/s ± 0%   +4.89%  (p=0.000 n=9+8)
Sum64String/4KB-64    3.28GB/s ± 0%  3.28GB/s ± 0%   +0.11%  (p=0.000 n=10+8)
Sum64String/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%   -0.01%  (p=0.006 n=10+9)
DigestBytes/4B-64      179MB/s ± 0%   179MB/s ± 0%   -0.07%  (p=0.001 n=7+7)
DigestBytes/100B-64   1.63GB/s ± 0%  1.64GB/s ± 0%   +0.53%  (p=0.000 n=9+8)
DigestBytes/4KB-64    3.25GB/s ± 0%  3.25GB/s ± 0%     ~     (p=0.138 n=10+10)
DigestBytes/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.051 n=10+8)
DigestString/4B-64     139MB/s ± 3%   137MB/s ± 2%     ~     (p=0.497 n=10+9)
DigestString/100B-64  1.52GB/s ± 1%  1.53GB/s ± 1%   +1.03%  (p=0.001 n=10+10)
DigestString/4KB-64   3.24GB/s ± 0%  3.23GB/s ± 0%   -0.07%  (p=0.000 n=9+10)
DigestString/10MB-64  3.33GB/s ± 0%  3.33GB/s ± 0%   -0.00%  (p=0.006 n=9+10)

It's suspicious that the speed is so precisely similar for larger block sizes. Seems like we must be hitting a hard bottleneck of some kind or else the generated code on the main path must be almost the same as the handwritten code.

greatroar commented 2 years ago

The generated code is very similar, the main difference being the use of LDP/STP to transfer 16-byte blocks in the handwritten code. That doesn't seem to buy much. The reason it's faster on smaller inputs may be that the handwritten code does everything in registers, while the compiled code spills some stuff on the stack, and it loads the primes faster.

Imanmahmoodi commented 2 years ago

😃

lizthegrey commented 2 years ago

Please let me know if you need help testing this; always happy to help volunteer to make things run faster on arm64.

cespare commented 2 years ago

@lizthegrey the problem is more that the testing has shown that the improvement is marginal.

lizthegrey commented 2 years ago

We see a 7% improvement in time spent in xxhash from adopting this within the @klauspost zstd library. https://share.getcloudapp.com/GGuEkRnR

IMO this is worth merging.

cespare commented 2 years ago

@lizthegrey it would be helpful to know how the results you've screenshotted correspond to the microbenchmarks we've been discussing above. Do the zstd benchmarks run xxhash over mostly very short strings, for instance?

I took another look at the benchmarks using Go 1.18beta1. (I also added a 16-byte comparison on main.) Here are the results, again using a c6g.metal EC2 instance:

name                  old speed      new speed      delta
Sum64/4B-64            469MB/s ± 0%   672MB/s ± 0%  +43.24%  (p=0.000 n=8+10)
Sum64/16B-64          1.12GB/s ± 0%  1.61GB/s ± 0%  +43.24%  (p=0.000 n=9+10)
Sum64/100B-64         2.04GB/s ± 0%  2.07GB/s ± 0%   +1.67%  (p=0.000 n=10+10)
Sum64/4KB-64          3.28GB/s ± 0%  3.28GB/s ± 0%   -0.09%  (p=0.000 n=10+10)
Sum64/10MB-64         3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.094 n=10+9)
Sum64String/4B-64      441MB/s ± 0%   499MB/s ± 0%  +13.33%  (p=0.000 n=10+7)
Sum64String/16B-64    1.08GB/s ± 1%  1.43GB/s ± 0%  +32.17%  (p=0.000 n=10+10)
Sum64String/100B-64   2.02GB/s ± 0%  2.08GB/s ± 0%   +3.12%  (p=0.000 n=9+10)
Sum64String/4KB-64    3.28GB/s ± 0%  3.28GB/s ± 0%   +0.00%  (p=0.028 n=9+9)
Sum64String/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.148 n=10+10)
DigestBytes/4B-64      207MB/s ± 1%   206MB/s ± 0%   -0.42%  (p=0.000 n=10+9)
DigestBytes/16B-64     669MB/s ± 0%   666MB/s ± 0%   -0.52%  (p=0.000 n=10+8)
DigestBytes/100B-64   1.85GB/s ± 0%  1.80GB/s ± 0%   -2.84%  (p=0.000 n=10+10)
DigestBytes/4KB-64    3.27GB/s ± 0%  3.26GB/s ± 0%   -0.09%  (p=0.000 n=10+10)
DigestBytes/10MB-64   3.33GB/s ± 0%  3.33GB/s ± 0%     ~     (p=0.403 n=10+10)
DigestString/4B-64     200MB/s ± 0%   199MB/s ± 0%   -0.24%  (p=0.000 n=8+10)
DigestString/16B-64    645MB/s ± 0%   646MB/s ± 0%     ~     (p=0.065 n=9+10)
DigestString/100B-64  1.82GB/s ± 0%  1.71GB/s ± 1%   -6.20%  (p=0.000 n=10+10)
DigestString/4KB-64   3.27GB/s ± 0%  3.26GB/s ± 0%   -0.30%  (p=0.000 n=8+10)
DigestString/10MB-64  3.33GB/s ± 0%  3.33GB/s ± 0%   +0.01%  (p=0.002 n=10+9)

We still see that the asm is significantly faster only for small strings (4 and 16 bytes) when using the one-shot hashing functions and the advantage has narrowed a little due to compiler improvements for arm64 in Go 1.18. There is also a measurable slowdown from the asm in a couple of the benchmarks.

For context, here is a comparison on amd64 (using a c5.metal instance) between purego and asm:

name                  old speed      new speed       delta
Sum64/4B-96            882MB/s ± 1%   1065MB/s ± 1%  +20.74%  (p=0.000 n=9+9)
Sum64/16B-96          2.41GB/s ± 0%   3.01GB/s ± 2%  +25.25%  (p=0.000 n=10+10)
Sum64/100B-96         5.66GB/s ± 1%   6.60GB/s ± 1%  +16.58%  (p=0.000 n=8+9)
Sum64/4KB-96          10.0GB/s ± 1%   14.5GB/s ± 0%  +45.92%  (p=0.000 n=10+8)
Sum64/10MB-96         10.2GB/s ± 1%   15.1GB/s ± 0%  +47.33%  (p=0.000 n=9+9)
Sum64String/4B-96      838MB/s ± 0%    902MB/s ± 0%   +7.64%  (p=0.000 n=8+9)
Sum64String/16B-96    2.32GB/s ± 1%   2.70GB/s ± 1%  +16.43%  (p=0.000 n=9+10)
Sum64String/100B-96   5.59GB/s ± 0%   6.21GB/s ± 3%  +11.24%  (p=0.000 n=8+10)
Sum64String/4KB-96    10.0GB/s ± 0%   14.5GB/s ± 0%  +45.67%  (p=0.000 n=7+10)
Sum64String/10MB-96   10.3GB/s ± 0%   15.1GB/s ± 1%  +47.09%  (p=0.000 n=9+9)
DigestBytes/4B-96      391MB/s ± 1%    393MB/s ± 0%   +0.51%  (p=0.004 n=9+9)
DigestBytes/16B-96    1.29GB/s ± 1%   1.29GB/s ± 1%     ~     (p=0.780 n=9+10)
DigestBytes/100B-96   3.94GB/s ± 1%   4.37GB/s ± 2%  +10.83%  (p=0.000 n=8+10)
DigestBytes/4KB-96    9.79GB/s ± 1%  14.08GB/s ± 0%  +43.82%  (p=0.000 n=9+9)
DigestBytes/10MB-96   10.3GB/s ± 0%   15.1GB/s ± 1%  +47.08%  (p=0.000 n=10+10)
DigestString/4B-96     382MB/s ± 1%    385MB/s ± 1%   +0.82%  (p=0.008 n=8+9)
DigestString/16B-96   1.26GB/s ± 2%   1.27GB/s ± 1%     ~     (p=0.079 n=9+10)
DigestString/100B-96  3.92GB/s ± 2%   4.14GB/s ± 1%   +5.53%  (p=0.000 n=10+10)
DigestString/4KB-96   9.79GB/s ± 1%  14.00GB/s ± 0%  +43.04%  (p=0.000 n=8+9)
DigestString/10MB-96  10.3GB/s ± 1%   15.1GB/s ± 1%  +47.40%  (p=0.000 n=9+9)

You can see that the assembly version offers a clear benefit over pure Go across the board, with the largest improvement going to the biggest inputs.


I'm just not that excited about adding an assembly implementation which only produces benefits on input sizes that are about the same size as the hash function itself, which is not a pareto improvement on the benchmarks we've got, and which seems likely to get worse relative to pure-Go with each passing release given the large improvements to arm64 that keep happening in the compiler.

That said, I would like to sit down and understand the implementation and see if there are ways to improve it, particularly the handling of full blocks. I'll try to make some time for that soon.

AWSjswinney commented 2 years ago

I'm just not that excited about adding an assembly implementation which only produces benefits on input sizes that are about the same size as the hash function itself, which is not a pareto improvement on the benchmarks we've got, and which seems likely to get worse relative to pure-Go with each passing release given the large improvements to arm64 that keep happening in the compiler.

This is a valid concern. Let me add another data point. Graviton 2 and the Raspberry Pi 4 may not see significant improvement because they are most likely bottlenecking on the floating-point/simd pipelines. Other CPUs, such as Graviton 3, will see improvement from this change.

I tested this code as written on a c7g.2xlarge and go 1.17.

name                 old time/op    new time/op    delta
Sum64/4B-8             5.13ns ± 0%    3.09ns ± 0%  -39.82%  (p=0.000 n=10+10)
Sum64/100B-8           20.8ns ± 0%    12.4ns ± 0%  -40.43%  (p=0.000 n=10+9)
Sum64/4KB-8             351ns ± 0%     248ns ± 0%  -29.39%  (p=0.000 n=10+10)
Sum64/10MB-8            850µs ± 0%     602µs ± 0%  -29.22%  (p=0.000 n=10+10)
Sum64String/4B-8       5.39ns ± 0%    3.61ns ± 0%  -33.00%  (p=0.000 n=10+10)
Sum64String/100B-8     21.2ns ± 0%    13.1ns ± 0%  -38.17%  (p=0.000 n=10+10)
Sum64String/4KB-8       351ns ± 0%     248ns ± 0%  -29.22%  (p=0.000 n=10+10)
Sum64String/10MB-8      850µs ± 0%     603µs ± 0%  -29.10%  (p=0.000 n=10+10)
DigestBytes/4B-8       12.6ns ± 0%    12.6ns ± 0%   +0.35%  (p=0.000 n=10+10)
DigestBytes/100B-8     27.4ns ± 0%    24.6ns ± 0%  -10.00%  (p=0.000 n=9+10)
DigestBytes/4KB-8       355ns ± 0%     258ns ± 0%  -27.48%  (p=0.000 n=10+10)
DigestBytes/10MB-8      850µs ± 0%     602µs ± 0%  -29.18%  (p=0.000 n=10+9)
DigestString/4B-8      15.9ns ± 0%    15.9ns ± 0%   +0.10%  (p=0.041 n=10+10)
DigestString/100B-8    27.8ns ± 0%    25.9ns ± 0%   -6.95%  (p=0.000 n=10+9)
DigestString/4KB-8      356ns ± 0%     259ns ± 0%  -27.15%  (p=0.000 n=9+10)
DigestString/10MB-8     849µs ± 0%     602µs ± 0%  -29.03%  (p=0.000 n=10+10)

name                 old speed      new speed      delta
Sum64/4B-8            779MB/s ± 0%  1295MB/s ± 0%  +66.18%  (p=0.000 n=10+10)
Sum64/100B-8         4.80GB/s ± 0%  8.06GB/s ± 0%  +67.86%  (p=0.000 n=10+9)
Sum64/4KB-8          11.4GB/s ± 0%  16.2GB/s ± 0%  +41.62%  (p=0.000 n=9+10)
Sum64/10MB-8         11.8GB/s ± 0%  16.6GB/s ± 0%  +41.29%  (p=0.000 n=10+10)
Sum64String/4B-8      742MB/s ± 0%  1108MB/s ± 0%  +49.26%  (p=0.000 n=10+10)
Sum64String/100B-8   4.73GB/s ± 0%  7.64GB/s ± 0%  +61.72%  (p=0.000 n=9+10)
Sum64String/4KB-8    11.4GB/s ± 0%  16.1GB/s ± 0%  +41.27%  (p=0.000 n=9+10)
Sum64String/10MB-8   11.8GB/s ± 0%  16.6GB/s ± 0%  +41.04%  (p=0.000 n=10+10)
DigestBytes/4B-8      319MB/s ± 0%   318MB/s ± 0%   -0.35%  (p=0.000 n=10+10)
DigestBytes/100B-8   3.65GB/s ± 0%  4.06GB/s ± 0%  +11.11%  (p=0.000 n=9+10)
DigestBytes/4KB-8    11.3GB/s ± 0%  15.5GB/s ± 0%  +37.90%  (p=0.000 n=10+10)
DigestBytes/10MB-8   11.8GB/s ± 0%  16.6GB/s ± 0%  +41.20%  (p=0.000 n=10+9)
DigestString/4B-8     252MB/s ± 0%   252MB/s ± 0%     ~     (p=0.072 n=10+10)
DigestString/100B-8  3.60GB/s ± 0%  3.87GB/s ± 0%   +7.46%  (p=0.000 n=9+9)
DigestString/4KB-8   11.2GB/s ± 0%  15.4GB/s ± 0%  +37.27%  (p=0.000 n=10+10)
DigestString/10MB-8  11.8GB/s ± 0%  16.6GB/s ± 0%  +40.90%  (p=0.000 n=10+10)

This instance type is currently in customer preview, so it's not widely available, but once it is, this code will be definitely worthwhile.

AWSjswinney commented 2 years ago

I've yet to try if NEON instructions provide any further speedup. The code assumes that unaligned access is safe; there's a system control register flag that can forbid unaligned access, but the Go compiler also assumes it's turned off.

There aren't NEON instructions that support 64bit in and out multiply-adds, so using these scalar instructions as written is probably the best we can do for now. However, SVE does include instructions like this, so it's possible a newer implementation could see even better performance on CPUs with SVE support.

cespare commented 2 years ago

@AWSjswinney very nice! That's helpful and does give a lot more motivation for this change. (Also, the absolute numbers from that chip look great.)

lizthegrey commented 2 years ago

SVE

The challenge is going to be detecting whether the CPU supports SVE, as well as Go implementing support for the SVE opcodes. Until then I think we have to default compatible and just go with what's supported by default in Go. See https://github.com/golang/go/issues/44734

There's not yet GOARM64=, only GOARM= and GOAMD64= so that step might also be helpful for letting Go know which variant to use.

cespare commented 2 years ago

The challenge is going to be detecting whether the CPU supports SVE

Some googling led me to this page, which says:

Support for the execution of SVE instructions in userspace can also be detected by reading the CPU ID register ID_AA64PFR0_EL1 using an MRS instruction, and checking that the value of the SVE field is nonzero.

We could probably do that at startup. But a question is whether adding a (predictable) branch to every hash function call is too slow.

as well as Go implementing support for the SVE opcodes.

Do you mean adding the instructions to the Go assembler? If so, that's not necessary; we can always just emit the bytes directly.

There's not yet GOARM64=, only GOARM= and GOAMD64= so that step might also be helpful for letting Go know which variant to use.

As far as I know those env variables are only useful for telling the compiler what instructions it's allowed to use. A missing piece here is having build tags to select certain CPU features. That's tracked by https://github.com/golang/go/issues/45454. So we'd need both new GOARM64 variants that allow for selecting for some kind of baseline feature level that includes SVE and https://github.com/golang/go/issues/45454 to make that available as a build tag. With all of that we'd be able to avoid runtime feature detection.

klauspost commented 2 years ago

Support for the execution of SVE instructions in userspace can also be detected by reading the CPU ID register ID_AA64PFR0_EL1 using an MRS instruction, and checking that the value of the SVE field is nonzero.

Careful. This is a protected instruction, and requires the OS to intercept the call and handle it. Not all operating systems do that, so your program can crash on init.

For cpuid it is implemented rather complex. There may be simpler methods, but this is what I found - not being any sort of arm expert.

On Linux, first we attempt it by hooking up to the internal runtime cpu detection, which gets populated by the runtime with data that isn't exposed publicly. AFAICT there is no other way to get this information:

//go:linkname hwcap internal/cpu.HWCap
var hwcap uint

If this isn't set /proc/self/auxv is read (not available on Android, though, so it must fail gracefully).

If CPUID flag isn't set by some of the above, we do not read MIDR_EL1, since it may crash.

For OSX, I have not found any method for determining features, so just a base set is enabled.

lizthegrey commented 2 years ago

Do you mean adding the instructions to the Go assembler? If so, that's not necessary; we can always just emit the bytes directly.

That has a bit of a cost in terms of readability, but it'll work in the short term...

in any event, since there's some benefit for some real world use cases, no evidence of significant harm, and it's bottlenecking on floating point math on all but the newest hardware as @AWSjswinney says but there are significant speedups on the newest hardware, we should probably make the decision to go ahead and merge this as it is?

AWSjswinney commented 2 years ago

Adding an SVE implementation would be speculative at this point and it wouldn't be beneficial for a while until CPUs with SVE are widely available. By that time, the Go assembler should support all of the necessary SVE instructions. I haven't looked at the amd64 implementation much, but I suspect there may be something similar possible for that architecture with SSE or AVX. It's just an idea for the future.

I definitely think it makes sense to merge as is for now.

lizthegrey commented 2 years ago

Ping @cespare, any chance we can merge this so we can keep on iterating on it in future? The timing differences are basically noise if the bottleneck is the multiply units on older hardware, and it shows substantial benefit when the multiply unit bottleneck is removed by newest generation hardware.

cespare commented 2 years ago

@lizthegrey this remains on my TODO list. I need to understand and review the code.

lizthegrey commented 2 years ago

c7g graviton3 is now generally available. Profiling shows that this assembly code is now running at least 17% faster than on Graviton2.

https://flamegraph.com/share/18a16815-db9e-11ec-b50b-a2e3bd482a0d

snadampal commented 2 years ago

Any hope to get this reviewed and merged soon?