Enhanced Load Masking for Prefixes and Suffixes

SimSIMD predominantly relies on unaligned loads for its operations. In instances where AVX-512 is utilized, masked loads are employed to bypass sequential operations on tail elements. However, a more expedient, albeit advanced, scheme can be explored. Under the assumption that any byte within a 64-byte cache line partaking in a vector implies the entire cache line is accessible, we can shift towards exclusively using aligned loads. This approach entails fetching the complete cache line with each load, inevitably conducting some superfluous operations but decidedly evading unaligned loads.

To circumvent potential complications with memory sanitizers, it's advised to incorporate the following attributes: __attribute__((no_sanitize_address)) and __attribute__((no_sanitize_thread)).

ashvardanian / SimSIMD

Enhanced Load Masking for Prefixes and Suffixes #29