Closed danieldk closed 4 years ago
There is another interesting problem, the SIMD intrinsics are not inlined (apparently this is intentional to avoid inlining in such a way that speculative execution can run such instructions on CPUs that do not support it). I'll change it to force inlining.
Seems on par with the old implementation now. However, inspecting the instructions I recall the compiler doing unrolling of the loops, which does not seem to happen now.
I have made this a draft PR, because I haven't checked the performance impact at all. From reading
stdarch
it seems that feature detection is cached, so we wouldn't be using the expensivecpuid
instruction on every call. But I want to profile and inspect the assembly a bit.If it works out, dynamic feature detection is really nice: we don't have to explicitly compile for e.g. AVX anymore, but compile without any extra features and AVX would be used if the CPU is capable.