Lokathor / wide

A crate to help you go wide. By which I mean use SIMD stuff.
https://docs.rs/wide
zlib License
279 stars 23 forks source link

add slice methods #105

Closed ImUrX closed 3 years ago

ImUrX commented 3 years ago

im thinking of implementing Index and IndexMut but i want to know what you meant about the note to avoid in hot code, by that i mean how to implement that note lol closes #60

Lokathor commented 3 years ago

I'd kinda forgotten about the details of #60. These are converting the SIMD value to an array. That's useful, but the usual name would in that case be to_array instead of as_slice. This would be most similar to to_bits in the standard library.

Other than the name the code is all fine.

ImUrX commented 3 years ago

yeah its useful, i was thinking as_slice as the Vec method, but i just noticed it passes a reference instead, so i should really rename it to to_array

i saw that there are instructions for extracting and replacing lanes which could be pretty nice for implementing Index and IndexMut, should i continue on that idea too?

ImUrX commented 3 years ago

But now that i think about it Index and IndexMut expect references to the value which I dont think its gonna be possible

Lokathor commented 3 years ago

So what happens is that it works, but the compiler pushes the register onto the stack, performs the memory operation on the reference, and then pops the value off the stack. It's completely inefficient, which is why wide doesn't offer it.

If you still really want to do it, a person could use bytemuck to fiddle a reference to a SIMD value into a reference to a slice or something like that. But the hope is that the difficulty of doing such a thing will help prevent people from casually using Index and then losing a ton of performance.

ImUrX commented 3 years ago

you are right, i will just correct the names for this pr

torokati44 commented 2 years ago

I found that, in the use case of simply applying a mathematical function to a big array of input numbers, and putting the results into a different big array of output values, the to_array() function is really inefficient, as it likely always creates an unnecessary intermediate copy of the lanes (possibly even allocates, even if only on the stack). And I found no other way of accessing the individual lane values to copy them into the output array. But using bytemuck::cast_ref, this bulk processing is much faster (on wasm32 target at least), as the values can go straight from input array to vector registers to output array (I haven't verified this though).

Lokathor commented 2 years ago

I'm not entirely sure I understand your situation so let me see if I follow what you're saying:

In a situation like this, since you already mentioned bytemuck, I would suggest using pod_align_to. Then you can do the entire middle portion as an iteration over simd values.

If this doesn't work out for you please open a new issue though. I'm interested in having things work, but old PRs that are already merged is where things quickly get lost and forgotten about.