ashvardanian / SimSIMD

Up to 200x Faster Inner Products and Vector Similarity — for Python, JavaScript, Rust, C, and Swift, supporting f64, f32, f16 real & complex, i8, and binary vectors using SIMD for both x86 AVX2 & AVX-512 and Arm NEON & SVE 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
806 stars 42 forks source link

Enhanced Load Masking for Prefixes and Suffixes #29

Closed ashvardanian closed 5 months ago

ashvardanian commented 8 months ago

SimSIMD predominantly relies on unaligned loads for its operations. In instances where AVX-512 is utilized, masked loads are employed to bypass sequential operations on tail elements. However, a more expedient, albeit advanced, scheme can be explored. Under the assumption that any byte within a 64-byte cache line partaking in a vector implies the entire cache line is accessible, we can shift towards exclusively using aligned loads. This approach entails fetching the complete cache line with each load, inevitably conducting some superfluous operations but decidedly evading unaligned loads.

To circumvent potential complications with memory sanitizers, it's advised to incorporate the following attributes: __attribute__((no_sanitize_address)) and __attribute__((no_sanitize_thread)).