Lokathor / wide

A crate to help you go wide. By which I mean use SIMD stuff.
https://docs.rs/wide
zlib License
279 stars 23 forks source link

Add slice methods? v2 #111

Closed torokati44 closed 2 years ago

torokati44 commented 2 years ago

Sorry for necrobumping #105, let's continue here then.

Your understanding of what I'm doing is correct. I'm processing a planar image with 8 bits per pixel per component of depth, and I plan to speed it up by doing 4 adjacent pixels at a time in SIMD. And since it's a colorspace conversion operation, it needs to duplicate (upscale) chroma samples from the source on load. If there is a better lane arrangement for this case, please let me know.

Thanks for the pod_align_to tip, it's a neat function, and could come in handy, but unfortunately in this case, on each iteration I also have to convert from &[u8; 4] to i32x4, and then back for storing the results - for more bits of precision to work with in between. And I doubt that loading an [u8; 16], then throwing out 3/4 of it with a shuffle/interleave would be faster than doing it with a simple element-wise upcast, but that's just my guess. EDIT: Maybe some sort of simple gather operation would also come in handy for this!

And on the store side, I also have to transpose the 3 i32x4 (4xR, 4xG, 4xB) results to get an interleaved output image.

torokati44 commented 2 years ago

Basically, writing this:

let r: &[i32; 4] = bytemuck::cast_ref::<i32x4, [i32; 4]>(&r);
let g: &[i32; 4] = bytemuck::cast_ref::<i32x4, [i32; 4]>(&g);
let b: &[i32; 4] = bytemuck::cast_ref::<i32x4, [i32; 4]>(&b);

// The output alpha values are fixed
rgba.copy_from_slice(&[
    r[0] as u8, g[0] as u8, b[0] as u8, 255, r[1] as u8, g[1] as u8, b[1] as u8, 255,
    r[2] as u8, g[2] as u8, b[2] as u8, 255, r[3] as u8, g[3] as u8, b[3] as u8, 255,
]);

instead of this:

let r: [i32; 4] = r.to_array();
let g: [i32; 4] = g.to_array();
let b: [i32; 4] = b.to_array();

// The output alpha values are fixed
rgba.copy_from_slice(&[
    r[0] as u8, g[0] as u8, b[0] as u8, 255, r[1] as u8, g[1] as u8, b[1] as u8, 255,
    r[2] as u8, g[2] as u8, b[2] as u8, 255, r[3] as u8, g[3] as u8, b[3] as u8, 255,
]);

is much faster. It just looks even less neat. (rgba is a 16-long mutably borrowed chunk of the large output slice.)

Lokathor commented 2 years ago

I will not have time to look closely at this until later, but my initial suggestion is that, when I was doing a similar thing with the rust version of handmade hero, what I did was:

Since the alpha lane and color lanes generally did different ops, just having separate simd registers for each channel amd processing four pixels at a time that way ended up being easiest to do.

But I can come back and have a closer look at your scenario, hopefully later today.

torokati44 commented 2 years ago

In my case, the input is YUV (rather, something like Y'Cb'Cr', or whatever), so I have to do different things to each of the incoming components, then combine them to get the output RGB components - but the same thing for all pixels. And with the input being planar, no transpose is needed on load, only an integer width extension for each lane.

That's why I kept the lanes corresponding to pixels, and only had to transpose at the end. There is also no alpha involved in these operations, it just has to be written to the output as a constant.

Thankfully this simple conversion can also be done with only integer operations - at one point considering the lanes as being in 16.16 fixed-point representation, then shifting them back to the right to make them regular integers again.

torokati44 commented 2 years ago

All of this, I feel like, is not that important to the point of this issue, as it's only supposed to highlight that being able to copy from vector values into slices without having to copy them once more is useful sometimes, and having a concise way to do it would be nice.

Lokathor commented 2 years ago

Yeah, I had kinda assumed that the optimizer would delete the intermediate value, but apparently not.

Lokathor commented 2 years ago

Alright, yeah, so re-reading, I think one improvement might be to keep things in SIMD longer, because you can pack the channel_x4 values into the final pixel_x4 using SIMD, and I bet that would give you quite the boost. Instead of moving around 12 individual bytes (outside of simd), it would be just a few extra shift and bitor ops (in simd) using values you've already got set up in the registers.

Assuming that r4, g4, and b4 each have one channel value per lane that's already in the 0 to 255 range, you can merge them in simd like this:

let a4 = i32::from(255);

r4 = r4 << 24;
g4 = g4 << 16;
b4 = b4 << 8;

let merged: i32x4 = (r4 | g4) | (b4 | a4);

// assuming that some appropriate RGBA8 type exists in your crate that's marked as Pod:
let out4: [RGBA8; 4] = bytemuck::cast(merged);
out_slice[position..(position+4)].copy_from_slice(&out4);

(unfortunately the ShrAssign and ShlAssign traits currently aren't implemented for i32x4 it seems, so you can't use <<=, but oh well, someone could PR that I guess)


That said, I see your point that slice methods can reduce some pain here.

I still don't think that Index is quite appropriate for SIMD types, but if you wanted to PR some as_slice methods to turn &self into &[T] for the various simd types and their elements (and mut versions, and even the AsRef and AsMut traits and such if you like) that would probably be fine.

I just think that it should remain an explicit conversion, rather than automatic, because reading or writing a particular lane within a SIMD register forces the value out of register and onto the stack. And yes LLVM can handle that transparently in terms of API, but you still get a performance loss as the CPU has to push around data.

torokati44 commented 2 years ago

Oh, that's a neat idea, and it could indeed work, thanks a lot! Will give it a try. Well, our RGBA8 type in this particular context is [u8; 4]... :D Even an explicit conversion to a slice ref would be just fine for me. I'd love to send PRs to at least some of the things you mentioned, but I can't promise that I will actually find the time/energy to do so... :/

Lokathor commented 2 years ago

No rush! Since the portable-simd project is finally usable in Nightly I'm hoping that people can slowly move to that and eventually we won't need wide at all, since it'll all be in the standard library.

torokati44 commented 2 years ago

Yeah, I actually wanted to do this experiment with portable-simd (as std::simd), but then um...: https://github.com/rust-lang/rust/issues/89647 So, no nightly Rust for me, I guess!

torokati44 commented 2 years ago

Resolved by #112.