No obvious way to xor byte stream with [u8; 4]

Shnatsel commented 6 years ago

I'm trying to get into SIMD by implementing a trivial operation: XOR unmasking of a byte stream as required by the WebSocket specification. The implementation in x86 intrinsics is actually very straightforward, but I have a hard time wrapping my head around expressing it in terms of Faster iterators API.

The part I'm having trouble with is getting an input [u8; 4] to cycle within a SIMD vector of u8. I have looked at:

load() which does accept &[u8] as input, but its behavior in case of length mismatch is completely undocumented. It's also not obvious what offset parameter does.
Casting the input [u8; 4] to u32, calling vecs::u32s() and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.
Getting a SIMD vector of length 4 and arbitrary type inside it, load [u8; 4] into it (lengths now match, so it should work) then downcast repeatedly until I get a vector of u8 with arbitrary length. Except there seems to be no way to request a SIMD vector of length 4 and arbitrary type.
After over an hour of head-scratching I've noticed that From<u32x4> is implemented for u8x16, so I could replace Downcast with it in approach 2 and probably get the correct result, except I have no idea how such conversions interact with host endianness.

I actually expected this to be a trivial task. I guess for someone familiar with SIMD it is, but for the likes of me a snippet in examples/ folder that loads [u8; 4] into a vector would go a long way. Or perhaps even a convenience function in the API that deals with endianness properly, to make it harder to mess up.

AdamNiederer commented 6 years ago

There aren't any 32-bit SIMD registers on x86. Typically the "cross-platform" way to do something like that would be to repeat the 4 u8s across the width of a vector.

The stackoverflow thread you reference doesn't initialize the mask to anything, so I'm not sure about the exact pattern you're looking to xor with your byte stream. However, you should be able to do something like

byte_stream.simd_iter().simd_map(|v| v ^ u32s(YOUR_4_BYTE_PATTERN_HERE).be_u8s()).scalar_fill(&mut output)

Casting the input [u8; 4] to u32, calling vecs::u32s() and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.

Downcast preserves the numeric value of each element, or saturates it if it's too big to fit in the newly-sized integer. You can use the mem::transmute or faster's Transmute trait to re-type the vector without changing its contents.

Shnatsel commented 6 years ago

Typically the "cross-platform" way to do something like that would be to repeat the 4 u8s across the width of a vector.

Yes, this is exactly what I'm trying to do. However, currently the only way to do it is to convert [u8; 4] to a u32, call u32s() on that and then convert it to vector of u8s, which requires me to care about host endianness. It would be great if Faster could expose an API to repeat [u8; 4] across the width of a vector without touching endianness at all.

Thanks a lot for the code snippet! If I manage to get it to do what I want, I'll open a PR to add my code to examples.

AdamNiederer commented 6 years ago

You may be able to use then Endian trait (vec.to_be() and vec.to_le()) to ensure the endianness is correct across platforms. Since it's a computation on a constant, LLVM should be able to just compile the whole thing down to a movaps (or similar).

Shnatsel commented 6 years ago

Ignoring endianness for now, the following code seems to do roughly what I want:

let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_collect().truncate(buf.len())

However, this is 2x slower than scalar code that does the same in-place. I've tried rewriting this to mutate in-place with .simd_iter_mut() but I'm not sure that to chain it with to mutate the input: simd_map() insists I should return something, and there are no uses of .simd_iter_mut() in examples. Is in-place mutation even supported at this point?

AdamNiederer commented 6 years ago

In-place mutation is in progress, but I think you may be better served by a scalar_fill call instead of scalar_collect, as the latter performs a heap allocation. You also shouldn't need to truncate the output buffer, as faster takes care of misaligned and arbitrarily-sized data for you.

Shnatsel commented 6 years ago

The input can be almost arbitrarily large and is not guaranteed to fit on the stack, so it's going to be either a heap allocation or an in-place mutation.

FWIW I've also tried the following, but it was 25% slower than .scalar_collect():

let mask_u32 = u32::from_bytes(mask);
let pattern = faster::u32s(mask_u32).be_u8s();
let mut output = vec![0u8; buf.len()];
buf.simd_iter(u8s(0)).simd_map(|v| v ^ pattern).scalar_fill(&mut output);

Also: curiously, on an AMD CPU with AVX using RUSTFLAGS="-C target-cpu=native" actually makes the code 10% slower compared to RUSTFLAGS="-C target-cpu=x86-64 on a 10Kb input buffer.

AdamNiederer commented 6 years ago

Hm, that's interesting. Are you using Zen? I'll see if I can find anything weird in the disassembly and bench that code on my boxes.

Shnatsel commented 6 years ago

Nope, plain old FX-4300. I'm on nightly compiler obviously, so codegeneration might vary from version to version. Compiler version I've tested this on is rustc 1.28.0-nightly (60efbdead 2018-06-23).

I'm benchmarking with Criterion, I can share the entire test harness if you're interested.

AdamNiederer commented 6 years ago

That would be awesome, thanks. You won't see a speedup with AVX compared to SSE2 for what you're doing, but compiling for your native CPU shouldn't slow you down by that much.

Shnatsel commented 6 years ago

https://github.com/Shnatsel/tungstenite-rs/tree/mask-simd - it's under benches/ in branch mask-simd. I've been working on it in tungstenite codebase (websocket protocol implementation in Rust), but it's a self-contained file and can probably be decoupled fairly easily.

Shnatsel commented 6 years ago

Also, with SIMD disabled I get performance roughly equal to the naive per-byte XOR - apply_mask_fallback() in that file. The polyfill does not have to be that slow - for example, function apply_mask_fast32() in the same file that does not use SIMD instructions but operates on 32 bits at a time is 20x faster than apply_mask_fallback().

Shnatsel commented 6 years ago

Turns out AVX requires a switch to a higher power consumption state and this takes time; until that happens, it runs at significantly lower frequencies. So using AVX is not worthwhile unless you're going to use it for a while, so the time before the switch to a higher power state becomes negligible. Source

This is one possible explanation for the performance drop on AVX.

AdamNiederer commented 6 years ago

The main reason you're not seeing a speedup is because faster isn't well-optimized for AVX-only CPUs at the moment, and uses the SSE xor instruction. If you compile on a machine with AVX2, you will see the speedup.

Depending on how busy I am with work, a fix for this may or may not make it into the next release (along with runtime detection).

AdamNiederer commented 6 years ago

Here's the main loop, according to cargo-disassemble

.LBB3_7:
    cmp rax, rsi
    ja  .LBB3_8
    cmp rax, rcx
    ja  .LBB3_23
    vpxor   xmm0, xmm1, xmmword ptr [r8 + rax]
    vmovdqu xmmword ptr [rdx + rax], xmm0
    lea rdi, [rax + 16]
    add rax, 32
    cmp rax, rsi
    mov rax, rdi
    jbe .LBB3_7

The jump at the end makes sense, but the two at the beginning shouldn't be there. It looks like a bounds check, which hints that I may be missing an unchecked load somewhere.

AdamNiederer / faster

No obvious way to xor byte stream with [u8; 4] #46