ashvardanian / SimSIMD

Up to 200x Faster Dot Products & Similarity Metrics — for Python, Rust, C, JS, and Swift, supporting f64, f32, f16 real & complex, i8, and bit vectors using SIMD for both AVX2, AVX-512, NEON, SVE, & SVE2 📐
https://ashvardanian.com/posts/simsimd-faster-scipy/
Apache License 2.0
988 stars 59 forks source link

Faster FMA on Haswell #216

Closed ashvardanian closed 3 weeks ago

ashvardanian commented 3 weeks ago

Handling loads and stores with SIMD is tricky. Not because of up-casting, but the down-casting at the end of the loop. In AVX2 it's a drag! We keep it for another day and use AVX2 for the actual math and value clipping. The current variant operates at 15-19 GB/s as opposed to under 500 MB/s for serial code.

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
fma_u8_haswell<1536d>/min_time:10.000/threads:1          248 ns          248 ns     56523758 abs_delta=8.20566 bytes=18.6111G/s pairs=4.03886M/s relative_error=2.16737m
wsum_u8_haswell<1536d>/min_time:10.000/threads:1         197 ns          197 ns     71164289 abs_delta=7.76442 bytes=15.5983G/s pairs=5.07757M/s relative_error=2.86599m
fma_u8_sapphire<1536d>/min_time:10.000/threads:1        70.9 ns         70.9 ns    197581878 abs_delta=9.2812 bytes=64.9908G/s pairs=14.1039M/s relative_error=2.45142m
wsum_u8_sapphire<1536d>/min_time:10.000/threads:1       51.2 ns         51.2 ns    275604255 abs_delta=8.89144 bytes=60.0323G/s pairs=19.5418M/s relative_error=3.28203m
fma_u8_serial<1536d>/min_time:10.000/threads:1          9749 ns         9748 ns      1428411 abs_delta=1.66854 bytes=472.69M/s pairs=102.58k/s relative_error=440.882u
wsum_u8_serial<1536d>/min_time:10.000/threads:1         9455 ns         9455 ns      1488320 abs_delta=2.32787 bytes=324.901M/s pairs=105.762k/s relative_error=859.403u