Closed heflinstephenraj-sa-14411 closed 8 months ago
If you mean this section, then I've used AVX-512 as described in the post.
AVX2 code can only process 8 floats at a time, so for 1536 dimensions it needs at least 192 iterations. Even if each is only 0.3ns (1 CPU cycle on a 3GHz core), it would take at least 64ns. In reality, each iteration takes several cycles.
Does that answer your question?
In response to your blog, which AVX have you used to achieve that 118ns for floats? I ran the same experiment and obtained 7.30 ns for Time and 350 ns for CPU time in the Google benchmark report for avx2_f32_cos_1536d. Could you please clarify which value I should consider, and whether you used time or CPU time for 118 ns? @ashvardanian