BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
4.71k stars 315 forks source link

NEON version slower on Arm Cortex-A9 #403

Open tsat-psv opened 5 days ago

tsat-psv commented 5 days ago

I've compiled the C example with and without NEON optimizations setting -DBLAKE3_USE_NEON=1 -O3 -mfpu=neon-vfpv4 compiler flags, and to my surprise the non NEON variant seems to perform better. I've tested on a ~30MB file (both from RAM and flash, to rule out I/O) and here are the results:

Without NEON:

time ./b3sum < /dev/mtd5ro
5420676b03e59d74cd44331c200ea841cd247374f307ce838dc6a0d367f73774
real    0m 2.11s
user    0m 0.67s
sys 0m 0.08s

With NEON:

time ./b3sum < /dev/mtd5ro
5420676b03e59d74cd44331c200ea841cd247374f307ce838dc6a0d367f73774
real    0m 2.17s
user    0m 0.76s
sys 0m 0.04s

I've saw that there were some changes in the 1.5.1 release, so I tried the 1.5.0, but the results are the same.

Any suggestions on what might cause this?