ARM-software / astc-encoder

The Arm ASTC Encoder, a compressor for the Adaptive Scalable Texture Compression data format.
https://developer.arm.com/graphics
Apache License 2.0
1.08k stars 241 forks source link

Add gatherf_byte_inds for byte indices from memory #511

Closed rygorous closed 3 weeks ago

rygorous commented 3 weeks ago

All the gathers in the codebase pass a vint for indices that has just been initialized from an array of uint8_ts in memory.

This is significant because for the NEON/SSE emulation paths, there is no native gather instruction to begin with and the first step is to get the indices back to the integer pipe and split them into individual pieces. In this case it is definitely better to just load the indices on the int pipes to begin with; this formulation facilitates that. (Needs to be a template because unlike the original gatherf, there is no vint argument that implies the vector width for overload resolution.)

Additionally, the gathers in this codebase don't actually make use of predication (the predicates are always all on). That means we have a subset of gather functionality that is fairly easy to emulate manually: indices are readily available on the integer pipes, and no predication, so all we need to do is perform a known number of vector loads and assemble the result.

Therefore, provide an option to avoid gather instructions even on AVX2 where they do exist. Gather performance is middling on newer Intel uArchs and outright bad on older (pre-Skylake) P-core Intel uArchs, Intel's E-cores, and AMD's offerings. At least on my home Zen 4, doing the 8 broadcasts + shuffles is much faster than using the native gather instructions, to the tune of a ~13.5% reduction in total coding time.

Test results: (using MSVC 2022 as compiler)

solidpixel commented 3 weeks ago

On my home machine (Intel i5-6500K, CoffeeLake):

solidpixel commented 3 weeks ago

On my laptop (Apple M1)