Dawoodoz / DFPSR

Fast realtime softare rendering library for C++14 using SSE/AVX/NEON. 2D, 3D and isometric rendering with minimal system dependencies.
https://dawoodoz.com/dfpsr.html
78 stars 6 forks source link

Should AVX-3 (a.k.a. AVX-512) be used? #56

Closed Dawoodoz closed 9 months ago

Dawoodoz commented 1 year ago

Someone thought that the library might as well push for maximum performance all the way, by sacrificing some determinism. One can create another SIMD header containing longer vectors that beginners wanting determinism across hardware don't have to use.

Solutions:

Another problem is that 512-bit SIMD is only supported on the more expensive processor models, so compilers don't enable this feature by default. One would have to manually compile different versions, just like when enabling AVX2 for faster texture sampling.

Dawoodoz commented 1 year ago

Maybe start with 256-bit AVX/AVX2, because AVX2 is already used for gather instructions.

Dawoodoz commented 1 year ago

Low-end laptops have 256-bit AVX (floating-point operations) but not AVX2 (integer operations). Because AVX does not handle integers, it is not useful without AVX2.

Dawoodoz commented 1 year ago

Once the simd.h header became bloated, I started generating the emulated implementations using template functions and macros. Might as well add I16x8 and I16x16 types for sound processing then.

Dawoodoz commented 1 year ago

An experimental implementation with 256-bit SIMD passed regression tests, but it still needs to be used in different parts of the library and documented.

Dawoodoz commented 1 year ago

Right now, there is no way to check at runtime if the computer has all SIMD extensions that the program was built using, because inline assembly is not forward compatible with instruction sets 200 years into the future, thus not allowed in this library. Would be nice if running the 256-bit AVX2 binary on a computer without it could refer the user to the 128-bit SSE2 version, or if a launcher could tell the user which features are detected and recommend the right version of the program.

A workaround could be to execute an AVX2 operation within try-catch and check if it triggered a crash on the computer, but some systems translate the non-existing instructions to run anyway at a slower speed, so profiling a test run with calculations would be needed too.

Either way, having one binary for all extensions automatically would not be possible, because the library provides lots of portable SIMD intrinsics for making own filters, not just pre-made filters. Having runtime check on each assembler instruction would obviously be slower than not using SIMD at all.

Dawoodoz commented 1 year ago

A system for arbitrary length vectors taking advantage of AVX and AVX2 has been designed to be forward compatible with AVX3. Should then be able to use U8xX in a program today as U8x16 or U8x32, and automatically get it as U8x64 after recompiling with a future version of the library and AVX3 enabled.

Porting to AVX3 will however have to wait until affordable computers have access to the extension.

Dawoodoz commented 9 months ago

Looked for a new high-end CPU because my desktop broke down from old age a few months ago, but even the worst Core i9 14900K would not have any AVX-3 support. Only the server models have partial support for 512-bit SIMD, and those sound like hairdryers and take many minutes to boot, making an insufferable user experience.

Need to wait until processors that people can actually use have both float and integer 512-bit vector support. 256-bit AVX2 will have to be enough for now.