Byte shuffle / table lookup operations

Maratyszcza commented 7 years ago

All modern SIMD instruction sets provide an instruction to shuffle bytes of one of more SIMD registers using indices from another SIMD register. Such instructions can be used to do parallel table lookup in a small table that fits into few SIMD registers. The instructions are:

PSHUFB xmm, xmm/mem on x86 with SSSE3 (according to Steam hardware survey supported on 97.32% of machines). Parallel table lookup in a 16-entry table. Due to special handling of negative indices, it is easy to extend this operation to larger tables.
VPPERM xmm, xmm, xmm, xmm/mem on x86 with AMD XOP (supported on AMD Bulldozer-Piledriver-Streamroller-Excavator, but not on AMD Zen). Due to 2 source registers, does parallel table lookup in a 32-entry table.
VTBL.8 Dd, {reg-list}, Dm on ARM NEON, where reg-list is a list of 1, 2, 3, or 4 D (64-bit) registers. Depending on the number of registers in the list, this instruction does table lookup in a table of 8, 16, 24, or 32 entries. Complementary instruction VTBX makes it easy to extend table lookup to more than 32 entries.
TBL Vd.<T>, {reg-list}, Vm.<T> on ARM64, where reg-list is a list of 1, 2, 3, or 4 SIMD (128-bit) registers. This instruction enables parallel table lookup in a table of 16, 32, 48, or 64 entries, and complementary instruction TBX extends the operation to larger tables.
VPERM vrT, vrA, vrB, vrC on POWER/PowerPC with Altivec. Performs parallel table lookup in a 32-entry table.

I suggest WASM SIMD should expose the following 5 instructions:

i8x16.byteshuffle(idx: v128, table: v128) -> v128 - shuffle a byte vector using elements of another byte vector as indices
i8x16.byteshuffle(idx: v128, tablelo: v128, tablehi: v128) -> v128 - shuffle two concatenated byte vectors using elements of a byte vector as indices
i8x16.lookup(idx: v128, entry0: ImmByte, ..., entry15: ImmByte) -> v128 - table lookup in a hard-coded table of 16 byte elements.
i8x16.lookup(idx: v128, entry0: ImmByte, ..., entry31: ImmByte) -> v128 - table lookup in a hard-coded table of 32 byte elements.
i8x16.lookup(idx: v128, entry0: ImmByte, ..., entry63: ImmByte) -> v128 - table lookup in a hard-coded table of 64 byte elements.

Separate versions with hard-coded table elements are provided for efficiency: depending on architecture, modifying table entries may enable more efficient implementation. E.g. implementation of a 32-byte shift via PSHUFB instruction would find beneficial to XOR the second 16 elements with the first 16 elements.

rrwinterton commented 6 years ago

I would agree with the capability to do shuffles. The pshufb is very useful in ARGB and YUV conversions as well as matrix manipulation. If I understand the proposal the table data and index is your mask? Could the first two suggestions be made into 1 API? I guess there isn't a p residence for memory or register operations. Perms are nice but can be expensive. Nice for reverse lookups so agree good again to use. I my case I believe i used this for a hamming code but I think I need to go back and look but that is a use case. I probably wouldn't call it a lookup in the API because there are other uses but agree I like the instruction selection.

lemaitre commented 6 years ago

Hi, I just created a merge request on shuffling and permutations, and I think there is some overlap: #30 Here are the overlaps (might need some argument reordering):

i8x16.byteshuffle(idx: v128, table: v128) -> v128 == v8x16.permuteVar(a: v128, s: v128) -> v128
i8x16.byteshuffle(idx: v128, tablelo: v128, tablehi: v128) -> v128 == v8x16.shuffleVar(a: v128, b: v128, s: v128) -> v128

zeux commented 5 years ago

FWIW something along the lines of i8x16.byteshuffle is a feature I'd love to see in wasm.

As mentioned in the OP, it is supported by SSSE3 on x86 side via pshufb, and by NEON on ARM side via vtbl.

My library, https://github.com/zeux/meshoptimizer, provides a fast decoder for index & vertex data. It's currently ~5x slower with wasm (when compiled with Emscripten) compared to the native C++ version, and most of this difference seems to be due to the lack of SIMD, see https://github.com/zeux/meshoptimizer/commit/e9120f06467ece4b7dc816b8cf95194b151d623e. This decoder requires variable byte permute - it's not really feasible to emulate this since it scalarizes most of the work required to decode. In general, pshufb is critical for any sort of accelerated decompression.

zeux commented 5 years ago

In the linked PR, @tlively and @dtig requested more precise benchmark numbers that are more specific to byte shuffles. To that end, I've implemented an SSE2 fallback for the decoder (see https://github.com/zeux/meshoptimizer/commit/bbeafeb14b37d7b23b281251a82cd350c9387af3) - the only instruction the decoder used that wasn't in SSE2 was pshufb. I'd expect this version to map reasonably well to wasm-without-dynamic shuffle.

The performance numbers are as follows - they are all measured on gcc 7.3, x64, i7-8700K:

config=release-avx: 2.67 GB/s vertex, 2.57 GB/s vertex oct
config=release-ssse3: 2.62 GB/s vertex, 2.53 GB/s vertex oct
config=release-sse2: 1.58 GB/s vertex, 1.42 GB/s vertex oct
config=release: 0.88 GB/s vertex, 0.84 GB/s vertex oct

You can reproduce these numbers by checking out https://github.com/zeux/meshoptimizer/tree/wasm-simd and running something like

make config=release-ssse3 dev files=data/buddha.obj

(buddha.obj is a standard Buddha model that you can download from, for example, https://casual-effects.com/data/)

The difference between "vertex" and "vertex oct" is that "vertex" compresses an array of 16-byte vertex structures whereas "vertex oct" compresses an array of 12-byte vertex structures. It's the same code, but the data is different and it affects the use of pshufb/emulated pshufb so I'm including both for completeness.

You can see that the scalar version is ~3x slower than the AVX version; the AVX version is a tiny bit faster than SSSE3 version - I believe this is just due to better register allocation because of a use of 3-operand instructions, I'm including this because I'd expect browsers to use 3-operand instructions - SSSE3 is the baseline version with pshufb, and SSE2 replaces _mm_shuffle_epi8 with a partial scalar emulation path that only handles the table lookup.

The emulated pshufb path results in SSE2 version being 1.65x and 1.78x slower than SSSE3 on the two different data sets.

Note that the algorithm only uses pshufb in two places in the midst of many other SSE instructions, so I'd expect this to be far from the worst practical case, but even then the performance loss is substantial. I'm not sure how WASM will compare - baseline scalar implementation is currently substantially slower in WASM vs C; if memory accesses are ranged-checked against WASM heap, for example, it's possible that the emulation will be comparatively slower vs a true vector instruction.

Happy to answer any questions about the workload or to do more testing.

AndrewScheidecker commented 5 years ago

@zeux It looks like your use could be mapped to something like the AVX-512 vpcompressb instruction. Is that correct?

Once you do a shuffle with a mask read from memory, it's infeasible for WASM backends to pattern match that dynamic shuffle to more specialized instructions on different targets. I do think WASM needs a dynamic shuffle, but it seems like it would be worthwhile having specialized instructions for things like compress/expand.

(there's also some past discussion of this in #8 that I didn't see referenced in the more recent issues)

gnzlbg commented 5 years ago

but it seems like it would be worthwhile having specialized instructions for things like compress/expand.

FWIW LLVM has portable support for these: https://llvm.org/docs/LangRef.html#masked-vector-expanding-load-and-compressing-store-intrinsics , and they are in the pipeline for Rust's packed SIMD module.

zeux commented 5 years ago

@AndrewScheidecker Yeah this is correct I think. AVX-512 isn't a realistic SIMD target for broad use at this point, and I don't think this is available in other instructions sets - so it feels that here the correct thing to do is to expose dynamic shuffle since that has wide-spread support (and the uses are more broad than expanding loads). I'm not sure how well the compiler would be able to emulate expanding load without relying on a similar look up table which I haven't seen compilers synthesize before.

AndrewScheidecker commented 5 years ago

AVX-512 isn't a realistic SIMD target for broad use at this point, and I don't think this is available in other instructions sets

And the byte-wise compress/expand are in AVX-512 VBMI2, which isn't even in a shipping CPU yet.

so it feels that here the correct thing to do is to expose dynamic shuffle since that has wide-spread support (and the uses are more broad than expanding loads)

I think compress/expand and other specialized shuffles are no substitute for a dynamic shuffle instruction, but there are advantages to adding instructions that directly express patterns that would otherwise be a table+shuffle, even if it just turns into a table+shuffle on most architectures: it allows applications to transparently support future hardware extensions, which turn accelerates development and adoption of those extensions. The risk is that if WASM adds instructions that don't map well to future hardware, then they become redundant when new instructions that do are added.

I'm not sure how well the compiler would be able to emulate expanding load without relying on a similar look up table which I haven't seen compilers synthesize before.

It's not too hard for a WASM backend to translate a compress or expand instruction to a table+shuffle, but going the other direction isn't feasible. That said, LLVM does not yet correctly lower these portable compress/expand intrinsics on anything other than -mcpu=icelake-client or -mcpu=icelake-server: https://godbolt.org/z/mp7Z3V vs https://godbolt.org/z/eUq2Qn.

zeux commented 5 years ago

It's not too hard for a WASM backend to translate a compress or expand instruction to a table+shuffle, but going the other direction isn't feasible.

Sure - I agree with that; I was writing this from the perspective of LLVM that I've never seen generate large lookup tables for emulating a single instruction. It feels like the pragmatic thing to do now is to expose dynamic shuffles; if and when expand/compress becomes available and widespread, compress/expand instruction could be considered separately.

dtig commented 5 years ago

Is there an equivalent of the compress/expand shuffles on ARM as well?

gnzlbg commented 5 years ago

Is there an equivalent of the compress/expand shuffles on ARM as well?

In ARM SVE you would execute COMPACT on a vector with a mask to move the selected elements to the front (or the back?), and then just write the front to memory. In ARM NEON I don't know. If you have a run-time mask you have to go from the mask to a vector of shuffle indices, and then do a dynamic shuffle. If the mask is small (e.g. for a 4 element vector), you might be able to just do a table look up to get it, but I never implemented this myself (it makes more sense to just work with shuffle indices directly). On Aarch64 you can have really big tables for the lookups, but I don't recall how big they are.

WebAssembly / simd

Byte shuffle / table lookup operations #24