Closed javiabellan closed 3 months ago
Hi @javiabellan, thanks for the contribution! The while
is actually identical to a combination of if
and goto
, so the performance should be the same.
I am not sure the AVX-512 versions have room for improvement, but for Apple one can try to replace NEON with AMX. Let me know if you can test those 🤗
I see it. I was thinking about the condition in assembly. Maybe cheking if is not zero (JNZ) is faster than n < 16. But im not sure about that.
I think the improvement comes from each loop iteration, where the proposed code has 1 condition (while(n)
) instead of 2 ifs (if (n < 16)
and if (n)
). The main idea is not checking (at every iteration) if we are on the final tail or not, because we can know that a priori by avoiding being on the tail by modifing n
to n -= n_tail
.
The main disadvantage of the proposed code is the computation of int n_tail = n & 15
and n -= n_tail
but this O(1). And also the larger code by duplicating the ab_vec = _mm512_fmadd_ps(a_vec, b_vec, ab_vec);
line.
Looking at AVX512 dot product I tried to avoid the if inside the loop to make a faster code. Here is a (not tested) idea of the proposed code: