Faster FMA on Haswell - Githubissues

Handling loads and stores with SIMD is tricky. Not because of up-casting, but the down-casting at the end of the loop. In AVX2 it's a drag! We keep it for another day and use AVX2 for the actual math and value clipping. The current variant operates at 15-19 GB/s as opposed to under 500 MB/s for serial code.

------------------------------------------------------------------------------------------------------------
Benchmark                                                  Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------
fma_u8_haswell<1536d>/min_time:10.000/threads:1          248 ns          248 ns     56523758 abs_delta=8.20566 bytes=18.6111G/s pairs=4.03886M/s relative_error=2.16737m
wsum_u8_haswell<1536d>/min_time:10.000/threads:1         197 ns          197 ns     71164289 abs_delta=7.76442 bytes=15.5983G/s pairs=5.07757M/s relative_error=2.86599m
fma_u8_sapphire<1536d>/min_time:10.000/threads:1        70.9 ns         70.9 ns    197581878 abs_delta=9.2812 bytes=64.9908G/s pairs=14.1039M/s relative_error=2.45142m
wsum_u8_sapphire<1536d>/min_time:10.000/threads:1       51.2 ns         51.2 ns    275604255 abs_delta=8.89144 bytes=60.0323G/s pairs=19.5418M/s relative_error=3.28203m
fma_u8_serial<1536d>/min_time:10.000/threads:1          9749 ns         9748 ns      1428411 abs_delta=1.66854 bytes=472.69M/s pairs=102.58k/s relative_error=440.882u
wsum_u8_serial<1536d>/min_time:10.000/threads:1         9455 ns         9455 ns      1488320 abs_delta=2.32787 bytes=324.901M/s pairs=105.762k/s relative_error=859.403u

ashvardanian / SimSIMD

Faster FMA on Haswell #216