WebAssembly / simd

Branch of the spec repo scoped to discussion of SIMD in WebAssembly
Other
531 stars 43 forks source link

Load/store interleaved instructions #119

Open AndrewScheidecker opened 4 years ago

AndrewScheidecker commented 4 years ago

https://github.com/WebAssembly/simd/issues/118 was started in response to another issue concerning performance of loading deinterleaved data. The code in question is a port of some OpenCV SSE code to WASM SIMD: they are loading 16 interleaved RGB pixels, and deinterleaving it to 16 Rs, 16Gs, and 16Bs.

This is an instance of a common pattern, and the way to do it optimally is very ISA-specific: the OpenCV code they were porting has not just SSE 2 and Neon variants, but also SSSE 3 and SSE 4.1 variants! The corresponding Neon code maps directly to Neon's 3-way interleaved load instruction: ld3.

Given that this kind of interleaved load is very common, and the optimal code for it is very ISA and ISA-extension specific, I think we should consider adding instructions to WASM SIMD to allow runtimes to generate code that is optimized for the specific target that the code is running on.

These instructions are direct translations of the ARM Neon interleaved load instructions:

v8x16.load_interleaved_2 <memarg> : (i32) -> (v128, v128)
v8x16.load_interleaved_3 <memarg> : (i32) -> (v128, v128, v128)
v8x16.load_interleaved_4 <memarg> : (i32) -> (v128, v128, v128, v128)
v16x8.load_interleaved_2 <memarg> : (i32) -> (v128, v128)
v16x8.load_interleaved_3 <memarg> : (i32) -> (v128, v128, v128)
v16x8.load_interleaved_4 <memarg> : (i32) -> (v128, v128, v128, v128)
v32x4.load_interleaved_2 <memarg> : (i32) -> (v128, v128)
v32x4.load_interleaved_3 <memarg> : (i32) -> (v128, v128, v128)
v32x4.load_interleaved_4 <memarg> : (i32) -> (v128, v128, v128, v128)
v64x2.load_interleaved_2 <memarg> : (i32) -> (v128, v128)
v64x2.load_interleaved_3 <memarg> : (i32) -> (v128, v128, v128)
v64x2.load_interleaved_4 <memarg> : (i32) -> (v128, v128, v128, v128)

AxB.load_interleaved_C loads B*C interleaved A elements from contiguous memory at the given address, and deinterleaves them into C AxB vectors. Pseudo-code:

template<typename A, int B, int C>
void load_interleaved(const A mem[B*C], A result[C][B]) {
  for(int i = 0; i < B; ++i) {
    for(int j = 0; j < C; ++j) {
      result[j][i] = mem[i * C + j];
    }
  }
}

The complementary store_interleaved instructions are probably worthwhile as well, but I'd like to see what folks think of the load instructions first.

tlively commented 4 years ago

Interesting idea! These instructions all have multivalue types, so this has a soft dependency on the multivalue proposal. I don't think we want MVP SIMD to depend on multivalue, so I'm going to tag this as post-MVP.

AndrewScheidecker commented 4 years ago

Some WIP benchmarks from prototyping this in WAVM: https://gist.github.com/AndrewScheidecker/7d2075d0fb6cc1e2c71b57fa54e9d850

These instructions all have multivalue types, so this has a soft dependency on the multivalue proposal. I don't think we want MVP SIMD to depend on multivalue, so I'm going to tag this as post-MVP.

I agree that we don't want to add a dependency on multivalue. However, I think it would be pretty easy to add instructions with multiple results to the SIMD spec/reference interpreter without pulling in anything from the multivalue repo.