Closed SuperFluffy closed 5 years ago
This change makes sense, since it's a free win for the native kernels. Fallbacks need to be updated (like their debug assertion so neatly point out), so they do some more work, then, but that's probably good.
For the native kernels I'd pick a benchmark which uses the masked kernel, like the 127 size or something similar and benchmark that. For the fallback kernel we can just benchmark generally using any one of them as representative, why not the layout benchmarks.
Thanks!
The multiplication by
alpha
should be performed by the actual kernel. This leaves the masked kernel loop to only do addition when constructing the C matrix.Looks like there is a reliable gain for
f64
double precision of about 3-4% on my system (avx
andfma
enabled) at no cost:This is for
MMTEST_FEATURE=fallback
: