Open 1f604 opened 1 year ago
I don't know enough about NEON performance to give you a good answer, but I can say that my own experience on the RPi4 was pretty similar. It's a small boost over portable. I've seen cases where other ARM CPUs get closer to a 2x speedup though.
It's just about guaranteed that there's low-hanging fruit in the NEON code, and that we could speed it up by finding my dumb mistakes. blake3_neon.c
is a pretty naive port of the SSE4.1 implementation, and it's the only NEON code I've ever written.
I would have assumed that NEON version would be at least 400% faster than portable. Is this expected?
Native integers on ARM64 are 64-bit wide. NEON registers are 128-bit wide. The maximum attainable speedup is 2x. (In theory it can be faster if NEON instructions allow things that cannot be done with the integer instructions which can be leveraged, but this is rather unusual.)
BLAKE3 uses 32-bit words, though, so I think it makes sense to say the maximum attainable speedup is 4x?
The microarchitecture matters more than the instruction set. The Raspberry Pi 4 uses a Cortex-A72, and looking at the instruction properties we see that we can execute 2 scalar adds/xor/rotations per cycle (and some rotations may come for free). With NEON, we have 2 adds and xors per cycle, but no native rotation instructions, which are replaced by shl+shr+orr, the first 2 of which can only be dispatched to one execution unit per cycle.
The arithmetic operation count of the BLAKE3 core can be approximated by 336 adds, 224 xors, and 224 rotations. Since there is sufficient parallelism within the round, the bottleneck is instruction throughput and we can lower bound the performance by (336+224+224)/2/64 ~ 6.125 cycles per byte. For NEON on the other hand, we have (336/2 + 224/2 + 224(1+1+1/2))/(644) ~ 3.28 cycles per byte. Looking at eBASH we see a measured value of 4.78 cycles per byte.
So based on basic arithmetic costs alone, we are limited to at best a little under 2x speedup for NEON on this chip. The remainder of the overhead could be attributed to the rest of the compression function operations (e.g., transposing the message into place) or poor GCC code generation; this microarchitecture is not very wide, so instruction scheduling could still make a significant difference here. Hard to say without looking at specifics.
Hi all,
I compiled the example.c with and without NEON support on my Raspberry Pi 4 and got these results (using the same 2GB test file):
I also installed Rust and b3sum and got these results:
The running time is clearly not IO dominated since xxhash only took 2 seconds to hash while the NEON-compiled example took 9.5 seconds. Okay, so piping the file into b3sum instead of just calling b3sum file adds 2s to the running time. But even if we shave off 2 seconds due to piping in to stdin, it's clear that most of the time is spent in the CPU rather than IO.
So the results show that the NEON version of BLAKE3 is only about 26% faster than the portable version.
I don't understand why compiling with and without NEON doesn't seem to make that much of a difference.
I would have assumed that NEON version would be at least 400% faster than portable. Is this expected?
Maybe it is due to GCC producing bad NEON code? Is there an assembly version?
I am using GCC 10.2.1.
Thanks a lot!
EDIT: Compiling with clang 11.0.1-2 instead of GCC improved performance by about 7% (9.5s -> 8.9s average). I did not notice a difference after PGO with either GCC or clang.