Open Shnatsel opened 6 years ago
There aren't any 32-bit SIMD registers on x86. Typically the "cross-platform" way to do something like that would be to repeat the 4 u8
s across the width of a vector.
The stackoverflow thread you reference doesn't initialize the mask to anything, so I'm not sure about the exact pattern you're looking to xor with your byte stream. However, you should be able to do something like
byte_stream.simd_iter().simd_map(|v| v ^ u32s(YOUR_4_BYTE_PATTERN_HERE).be_u8s()).scalar_fill(&mut output)
Casting the input [u8; 4] to u32, calling vecs::u32s() and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.
Downcast
preserves the numeric value of each element, or saturates it if it's too big to fit in the newly-sized integer. You can use the mem::transmute
or faster's Transmute
trait to re-type the vector without changing its contents.
Typically the "cross-platform" way to do something like that would be to repeat the 4
u8s
across the width of a vector.
Yes, this is exactly what I'm trying to do. However, currently the only way to do it is to convert [u8; 4]
to a u32, call u32s()
on that and then convert it to vector of u8s, which requires me to care about host endianness. It would be great if Faster could expose an API to repeat [u8; 4]
across the width of a vector without touching endianness at all.
Thanks a lot for the code snippet! If I manage to get it to do what I want, I'll open a PR to add my code to examples.
You may be able to use then Endian
trait (vec.to_be()
and vec.to_le()
) to ensure the endianness is correct across platforms. Since it's a computation on a constant, LLVM should be able to just compile the whole thing down to a movaps
(or similar).
Ignoring endianness for now, the following code seems to do roughly what I want:
let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_collect().truncate(buf.len())
However, this is 2x slower than scalar code that does the same in-place. I've tried rewriting this to mutate in-place with .simd_iter_mut()
but I'm not sure that to chain it with to mutate the input: simd_map()
insists I should return something, and there are no uses of .simd_iter_mut()
in examples. Is in-place mutation even supported at this point?
In-place mutation is in progress, but I think you may be better served by a scalar_fill
call instead of scalar_collect
, as the latter performs a heap allocation. You also shouldn't need to truncate the output buffer, as faster takes care of misaligned and arbitrarily-sized data for you.
The input can be almost arbitrarily large and is not guaranteed to fit on the stack, so it's going to be either a heap allocation or an in-place mutation.
FWIW I've also tried the following, but it was 25% slower than .scalar_collect()
:
let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
let mut output = vec![0u8; buf.len()];
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_fill(&mut output);
Also: curiously, on an AMD CPU with AVX using RUSTFLAGS="-C target-cpu=native"
actually makes the code 10% slower compared to RUSTFLAGS="-C target-cpu=x86-64
on a 10Kb input buffer.
Hm, that's interesting. Are you using Zen? I'll see if I can find anything weird in the disassembly and bench that code on my boxes.
Nope, plain old FX-4300. I'm on nightly compiler obviously, so codegeneration might vary from version to version. Compiler version I've tested this on is rustc 1.28.0-nightly (60efbdead 2018-06-23)
.
I'm benchmarking with Criterion, I can share the entire test harness if you're interested.
That would be awesome, thanks. You won't see a speedup with AVX compared to SSE2 for what you're doing, but compiling for your native CPU shouldn't slow you down by that much.
https://github.com/Shnatsel/tungstenite-rs/tree/mask-simd - it's under benches/ in branch mask-simd
.
I've been working on it in tungstenite codebase (websocket protocol implementation in Rust), but it's a self-contained file and can probably be decoupled fairly easily.
Also, with SIMD disabled I get performance roughly equal to the naive per-byte XOR - apply_mask_fallback()
in that file. The polyfill does not have to be that slow - for example, function apply_mask_fast32()
in the same file that does not use SIMD instructions but operates on 32 bits at a time is 20x faster than apply_mask_fallback()
.
Turns out AVX requires a switch to a higher power consumption state and this takes time; until that happens, it runs at significantly lower frequencies. So using AVX is not worthwhile unless you're going to use it for a while, so the time before the switch to a higher power state becomes negligible. Source
This is one possible explanation for the performance drop on AVX.
The main reason you're not seeing a speedup is because faster
isn't well-optimized for AVX-only CPUs at the moment, and uses the SSE xor instruction. If you compile on a machine with AVX2, you will see the speedup.
Depending on how busy I am with work, a fix for this may or may not make it into the next release (along with runtime detection).
Here's the main loop, according to cargo-disassemble
.LBB3_7:
cmp rax, rsi
ja .LBB3_8
cmp rax, rcx
ja .LBB3_23
vpxor xmm0, xmm1, xmmword ptr [r8 + rax]
vmovdqu xmmword ptr [rdx + rax], xmm0
lea rdi, [rax + 16]
add rax, 32
cmp rax, rsi
mov rax, rdi
jbe .LBB3_7
The jump at the end makes sense, but the two at the beginning shouldn't be there. It looks like a bounds check, which hints that I may be missing an unchecked load somewhere.
I'm trying to get into SIMD by implementing a trivial operation: XOR unmasking of a byte stream as required by the WebSocket specification. The implementation in x86 intrinsics is actually very straightforward, but I have a hard time wrapping my head around expressing it in terms of Faster iterators API.
The part I'm having trouble with is getting an input
[u8; 4]
to cycle within a SIMD vector ofu8
. I have looked at:load()
which does accept&[u8]
as input, but its behavior in case of length mismatch is completely undocumented. It's also not obvious whatoffset
parameter does.[u8; 4]
tou32
, callingvecs::u32s()
and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.[u8; 4]
into it (lengths now match, so it should work) then downcast repeatedly until I get a vector of u8 with arbitrary length. Except there seems to be no way to request a SIMD vector of length 4 and arbitrary type.From<u32x4>
is implemented foru8x16
, so I could replace Downcast with it in approach 2 and probably get the correct result, except I have no idea how such conversions interact with host endianness.I actually expected this to be a trivial task. I guess for someone familiar with SIMD it is, but for the likes of me a snippet in examples/ folder that loads
[u8; 4]
into a vector would go a long way. Or perhaps even a convenience function in the API that deals with endianness properly, to make it harder to mess up.