BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
4.71k stars 315 forks source link

optimize neon loadu_128/storeu_128 #384

Closed divinity76 closed 3 months ago

divinity76 commented 4 months ago

vld1q_u8 and vst1q_u8 has no alignment requirements.

This improves performance on Oracle Cloud's VM.Standard.A1.Flex by 1.15% on a 16*1024 input, from 13920 nanoseconds down to 13800 nanoseconds (approx)

oconnor663 commented 3 months ago

I see a ~1% improvement on the Graviton2 CPU on my AWS instance too. Thanks!

oconnor663 commented 3 months ago

Released as part of v1.5.1.

divinity76 commented 3 months ago

I wonder if this might have made big endian work too 🤔 (doesn't really matter, nothing runs big endian)