Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
Handling loads and stores with SIMD is tricky. Not because of up-casting, but the down-casting at the end of the loop. In AVX2 it's a drag! We keep it for another day and use AVX2 for the actual math and value clipping. The current variant operates at 15-19 GB/s as opposed to under 500 MB/s for serial code.
Handling loads and stores with SIMD is tricky. Not because of up-casting, but the down-casting at the end of the loop. In AVX2 it's a drag! We keep it for another day and use AVX2 for the actual math and value clipping. The current variant operates at 15-19 GB/s as opposed to under 500 MB/s for serial code.