I've compiled the C example with and without NEON optimizations setting -DBLAKE3_USE_NEON=1 -O3 -mfpu=neon-vfpv4 compiler flags, and to my surprise the non NEON variant seems to perform better. I've tested on a ~30MB file (both from RAM and flash, to rule out I/O) and here are the results:
Without NEON:
time ./b3sum < /dev/mtd5ro
5420676b03e59d74cd44331c200ea841cd247374f307ce838dc6a0d367f73774
real 0m 2.11s
user 0m 0.67s
sys 0m 0.08s
With NEON:
time ./b3sum < /dev/mtd5ro
5420676b03e59d74cd44331c200ea841cd247374f307ce838dc6a0d367f73774
real 0m 2.17s
user 0m 0.76s
sys 0m 0.04s
I've saw that there were some changes in the 1.5.1 release, so I tried the 1.5.0, but the results are the same.
I've compiled the C example with and without NEON optimizations setting
-DBLAKE3_USE_NEON=1 -O3 -mfpu=neon-vfpv4
compiler flags, and to my surprise the non NEON variant seems to perform better. I've tested on a ~30MB file (both from RAM and flash, to rule out I/O) and here are the results:Without NEON:
With NEON:
I've saw that there were some changes in the
1.5.1
release, so I tried the1.5.0
, but the results are the same.Any suggestions on what might cause this?