Add gatherf_byte_inds for byte indices from memory

rygorous commented 3 weeks ago

All the gathers in the codebase pass a vint for indices that has just been initialized from an array of uint8_ts in memory.

This is significant because for the NEON/SSE emulation paths, there is no native gather instruction to begin with and the first step is to get the indices back to the integer pipe and split them into individual pieces. In this case it is definitely better to just load the indices on the int pipes to begin with; this formulation facilitates that. (Needs to be a template because unlike the original gatherf, there is no vint argument that implies the vector width for overload resolution.)

Additionally, the gathers in this codebase don't actually make use of predication (the predicates are always all on). That means we have a subset of gather functionality that is fairly easy to emulate manually: indices are readily available on the integer pipes, and no predication, so all we need to do is perform a known number of vector loads and assemble the result.

Therefore, provide an option to avoid gather instructions even on AVX2 where they do exist. Gather performance is middling on newer Intel uArchs and outright bad on older (pre-Skylake) P-core Intel uArchs, Intel's E-cores, and AMD's offerings. At least on my home Zen 4, doing the 8 broadcasts + shuffles is much faster than using the native gather instructions, to the tune of a ~13.5% reduction in total coding time.

Test results: (using MSVC 2022 as compiler)

On Intel Skylake-X, using the manual gathers is appreciably slower than the native gather instructions. (+6% coding time in my tests)
On AMD Zen 2 and Zen 4, avoiding gathers is much faster (as noted above, 13.5% reduction on Zen 4).
On Intel Redwood Cove and Intel Crestmont, avoiding gathers comes out around 3-4% faster in my tests depending on the test.

solidpixel commented 3 weeks ago

On my home machine (Intel i5-6500K, CoffeeLake):

SSE4.1 - 3-4% faster by avoiding the byte-to-int conversion.
NoGather AVX2 - comes in around 6% slower (tested with both Clang 14 and GCC 11)

solidpixel commented 3 weeks ago

On my laptop (Apple M1)

NEON - is 2% faster by avoiding the byte-to-int conversion.

ARM-software / astc-encoder

Add gatherf_byte_inds for byte indices from memory #511