Closed ejmahler closed 9 months ago
Nice!
It's clear that the compiler doesn't automate mul -> add into a FMA,
Yes compilers don't usually automatically use fma. That gives one less rounding so the result is a tiny bit more accurate, but it's different so not completely equivalent.
Small PR that makes explicit use of FMA in butterfly3.
Mild performance gains - 5-10% for the smallest butterflies, and the gains are smaller the more non-butterfly3 work it's doing. Still a clear win.
It's clear that the compiler doesn't automate mul -> add into a FMA, and I notice that the neon prime butterflies don't make any explicit use of FMA, so we stand to gain a lot by rewriting the neon prime butterflies to explicitly use FMA. That's a bigger task to update the automated script though so it's not included here.