Closed Dawoodoz closed 9 months ago
Maybe start with 256-bit AVX/AVX2, because AVX2 is already used for gather instructions.
Low-end laptops have 256-bit AVX (floating-point operations) but not AVX2 (integer operations). Because AVX does not handle integers, it is not useful without AVX2.
Once the simd.h header became bloated, I started generating the emulated implementations using template functions and macros. Might as well add I16x8 and I16x16 types for sound processing then.
An experimental implementation with 256-bit SIMD passed regression tests, but it still needs to be used in different parts of the library and documented.
Right now, there is no way to check at runtime if the computer has all SIMD extensions that the program was built using, because inline assembly is not forward compatible with instruction sets 200 years into the future, thus not allowed in this library. Would be nice if running the 256-bit AVX2 binary on a computer without it could refer the user to the 128-bit SSE2 version, or if a launcher could tell the user which features are detected and recommend the right version of the program.
A workaround could be to execute an AVX2 operation within try-catch and check if it triggered a crash on the computer, but some systems translate the non-existing instructions to run anyway at a slower speed, so profiling a test run with calculations would be needed too.
Either way, having one binary for all extensions automatically would not be possible, because the library provides lots of portable SIMD intrinsics for making own filters, not just pre-made filters. Having runtime check on each assembler instruction would obviously be slower than not using SIMD at all.
A system for arbitrary length vectors taking advantage of AVX and AVX2 has been designed to be forward compatible with AVX3. Should then be able to use U8xX in a program today as U8x16 or U8x32, and automatically get it as U8x64 after recompiling with a future version of the library and AVX3 enabled.
Porting to AVX3 will however have to wait until affordable computers have access to the extension.
Looked for a new high-end CPU because my desktop broke down from old age a few months ago, but even the worst Core i9 14900K would not have any AVX-3 support. Only the server models have partial support for 512-bit SIMD, and those sound like hairdryers and take many minutes to boot, making an insufferable user experience.
Need to wait until processors that people can actually use have both float and integer 512-bit vector support. 256-bit AVX2 will have to be enough for now.
Someone thought that the library might as well push for maximum performance all the way, by sacrificing some determinism. One can create another SIMD header containing longer vectors that beginners wanting determinism across hardware don't have to use.
Solutions:
Using longer vectors than 128 bits as a fixed length type across all platforms would risk running out of registers on ARMv7 where 128 bit quad registers are the largest available.
Using variable length SIMD vectors would make it very difficult for people without access to all platforms to participate in the development, when the same code behaves differently on different platforms. One could however create emulator modes where one simulates different fixed SIMD lengths without caring about performance. One could set the length of the default vector to 128, 256, 512, or 1024 bits and have buffers and images aligned accordingly. Another problem then is that very large vectorization is often divided into small and large vectorization, where you have less padding for small images and more padding for large images running with 1024 bit SIMD.
Another problem is that 512-bit SIMD is only supported on the more expensive processor models, so compilers don't enable this feature by default. One would have to manually compile different versions, just like when enabling AVX2 for faster texture sampling.