Lokathor / wide

A crate to help you go wide. By which I mean use SIMD stuff.
https://docs.rs/wide
zlib License
288 stars 24 forks source link

Provide matrix transpose for 8x8 types #128

Closed mcroomp closed 1 year ago

mcroomp commented 1 year ago

Last PR for a while I promise :) I'm working on image processing, and this is the last piece I need that would significantly benefit from running wide and it's not easy to implement with the current building blocks.

Implement 8x8 transpose for i32x8, i16x8 and f32x8. For the 32-bit types, it only accelerates for AVX. I think the benefits of using 128-bit SIMD aren't that great compared to just scalar, but if someone wants to prove me wrong...

Lokathor commented 1 year ago

This is certainly not the best way to have simd matrix support. The better way is to have a normal matrix but instead of filling in each element as f32 (or whichever) you fill the elements be f32xN. The ultraviolet crate is an example.

we can still add this new code but it's not the best use of simd is all.

mcroomp commented 1 year ago

This is certainly not the best way to have simd matrix support. The better way is to have a normal matrix but instead of filling in each element as f32 (or whichever) you fill the elements be f32xN. The ultraviolet crate is an example.

we can still add this new code but it's not the best use of simd is all.

I agree, definitely not designed for general purpose matrix work, but in image processing/compression there are a lot of instances of 8x8 DCTs, gausian blurring, etc, generally the pattern for separable 2D convolution that can be expressed as two 1D convolutions with a transpose the middle:

convolute_1D transpose convolute_1D transpose

The scenario for me is the transpose during a 8x8 DCT/IDCT, where everything is already in 8 1d vectors, but I need to do a transpose operation to repeat the second pass on the y-axis of the DCT. Using SIMD for the transpose brings the overhead of the transpose from 40% down to 10%.

mcroomp commented 1 year ago

Added example IDCT implementation to t_usefulness, which is also a good exercise of wide operations to ensure that we have exactly the same behaviors on all architectures.