Inefficient x64 codegen for 8x16 shifts

abrown commented 5 years ago

While attempting to lower shl and shr (https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#bit-shifts) in cranelift, I observed that following instructions would involve a non-optimal lowering to x86:

i8x16.shl
i8x16.shr_s
i8x16.shr_u
i64x2.shr_s

I see that, e.g., v8 lowers i8x16.shl to 10 instructions (https://github.com/v8/v8/blob/5097dcb706b4438cf2ba3da5dfacfbc36643759c/src/compiler/backend/x64/code-generator-x64.cc#L3358-L3398) and am concerned that unsuspecting users may be hit with performance cliffs on x86. Are there better ideas out there on how to lower this? Are there workloads out there that would use these instructions?

tlively commented 5 years ago

See some recent discussion of shift semantics in https://github.com/WebAssembly/simd/issues/100.

dtig commented 5 years ago

The shifts were included as these are useful operations, narrow shifts in the initial version for codecs, and 64x2 shifts with benchmarking results from @ngzhian. The codegen on Intel hardware is suboptimal, though I'm not sure that it introduces a performance cliff as mostly SIMD instructions are used. The rationale for some of these commonly used instructions is that these are useful operations to have, and expensive to emulate in the absence of operations that directly lower to codegen.

abrown commented 4 years ago

@dtig, we captured in the notes that there are some uses of 8x16 shifts in vp8 code. Looking at https://github.com/webmproject/libvpx, I see some uses of 8x16 shifts in the ARM tree (e.g. https://github.com/webmproject/libvpx/tree/master/vpx_dsp/arm/loopfilter_4_neon.asm) but obviously not in the x86 tree (e.g. https://github.com/webmproject/libvpx/blob/master/vpx_dsp/x86/loopfilter_avx2.c). Is this the use you were referring to?

It is unclear to me, unfamiliar with the code, if using a more granular 8x16 shift is an advantage for performance reasons or simply for convenience reasons like you describe above ("useful operations"). If the reason is convenience only, there is likely a case to be made for dropping these 8x16 shifts to avoid penalizing x86 since in that case there is an alternate, performance-equivalent, code sequence (along the lines of whatever x86 has to do). If the reason is performance, that is a different story. Is there any way to understand which of these reasons, performance or convenience, weighs heavier in this instruction selection? I'm thinking of some form of benchmark or some explanation why the 8-bit lanes are preferable to 16-bit lanes.

You mentioned there were other benchmarks making use of the 8x16 shift; can you link to them?

nfrechette commented 4 years ago

Shifts are commonly used with fixed point arithmetic but it isn't as common on 8 bit values (16/32/64 bit being the most common). I also imagine that it might be even less common for the type of code that will target wasm.

dtig commented 4 years ago

@dtig, we captured in the notes that there are some uses of 8x16 shifts in vp8 code. Looking at https://github.com/webmproject/libvpx, I see some uses of 8x16 shifts in the ARM tree (e.g. https://github.com/webmproject/libvpx/tree/master/vpx_dsp/arm/loopfilter_4_neon.asm) but obviously not in the x86 tree (e.g. https://github.com/webmproject/libvpx/blob/master/vpx_dsp/x86/loopfilter_avx2.c). Is this the use you were referring to?

These are some examples, there should be more, but from the feedback we've gotten anything in the dsp directories is on the critical path for performance.

It is unclear to me, unfamiliar with the code, if using a more granular 8x16 shift is an advantage for performance reasons or simply for convenience reasons like you describe above ("useful operations"). If the reason is convenience only, there is likely a case to be made for dropping these 8x16 shifts to avoid penalizing x86 since in that case there is an alternate, performance-equivalent, code sequence (along the lines of whatever x86 has to do). If the reason is performance, that is a different story. Is there any way to understand which of these reasons, performance or convenience, weighs heavier in this instruction selection? I'm thinking of some form of benchmark or some explanation why the 8-bit lanes are preferable to 16-bit lanes.

From what I understand, what you are recommending here is to get rid of the 8x16 shift, and document an accepted code pattern for applications that use these shifts. This is something we've strongly avoided, because this is a way applications can encode performance cliffs. Can you add specific details about alternate, performance-equivalent code sequences on relevant architectures? (At least Neon, ARMv7 are required, and any others are optional, but useful to have). The performance definitely weighs higher IMO, though other opinions here are useful as well. The convenience reason here is that the x86 instruction set is asymmetrical in many different ways, but where possible it would be nice to not bias, or encode the same inconsistencies in a portable-SIMD proposal. Most of the operations that are harder to emulate on x86 aren't included in this proposal except for basic operations, emulating Shifts without documenting accepted patterns is very slow on ARM, and developers that are aware of those patterns can use them without compromising any existing performance on x86.

You mentioned there were other benchmarks making use of the 8x16 shift; can you link to them?

abrown commented 4 years ago

what you are recommending here is to get rid of the 8x16 shift, and document an accepted code pattern for applications that use these shifts

I don't think I am, but let me explain. I am recommending that we do get rid of the 8x16 shift if adding the instruction is only for convenience; if there is a performance-critical reason to keep those shifts then I would change my tune, but that is why I am 1) asking for feedback about the effect of 8x16 shifts (vs 16x8 shift) on libvpx performance and 2) asking for other workloads that might use these instructions in a performance-critical way.

emulating Shifts without documenting accepted patterns is very slow on ARM

The way I see it, looking at libvpx, there is an ARM implementation of the DSP using 8x16 shifts and an x86 implementation of the DSP using 16x8 shifts (e.g., PSRAW in x86 SSE2). If we were to drop the 8x16 shifts, the libvpx team could still compile the x86 version with its 16x8 shifts to Wasm and Wasm runtimes could still lower to ARM NEON in a single 16x8 VSHR. But if we retain the 8x16 shifts and an unsuspecting developer compiled the ARM version of the DSP to Wasm, Wasm runtimes would encounter a ~10 instruction penalty on x86 architectures. Why encode a penalty to an architecture into the spec? Right now I'm arguing we should avoid this, but I could be convinced this would an OK trade-off if the 8x16 approach had a performance advantage over the 16x8 approach over several workloads--hence my questions about other uses of the 8x16 shifts.

it would be nice to not bias, or encode the same inconsistencies in a portable-SIMD proposal

I agree. It would be nicer and more symmetrical if x86 had 8x16 shifts but, as @nfrechette mentions, perhaps their use is uncommon (understanding that that could be reinforced by the instructions just not being present in x86 😄). I'll set up a meeting to discuss further but hopefully this gives some perspective on where I'm coming from.

abrown commented 4 years ago

@arunetm mentioned that it might be a good idea to get a sense from the community how 8x16 shift instructions would be used. For context, I'm creating a poll within this issue; could you respond using the emoji reactions to this comment and tag other developers with opinions on 8x16 shift inclusion in Wasm SIMD? cc: @dtig, @Maratyszcza, @tlively, @penzn, @nfrechette, @zeux

If you consider 8x16 shifts to be uncommon instructions (e.g. absent in real-world applications, easily emulated by 16x8 shifts, etc.), vote with a 👀.
If you consider 8x16 shifts to be nice-to-have instructions (e.g. for symmetry between ISAs, ease-of-use when doing SIMD programming), vote with a 🎉.
If you consider 8x16 shifts to be performance-critical instructions (e.g. necessary for performance speed-ups in real-world applications), vote with a 🚀.

(I think it makes sense to be able to vote for multiple of these so feel free to react with multiple emojis).

abrown commented 4 years ago

To circle back to the "performance cliff" argument, here are some measurements: I created x86 assembly code for the v8 implementation of i8x16.shl (an 11 instruction lowering) and, having no x86 encoding for i8x16.shl, created x86 assembly code for i16x8.shl (a 1 instruction lowering). The entire gist of the code is available here.

On my machine, the emulated v8 implementation took ~0.112s to complete 10M iterations and the 16x8 implementation took ~0.023s, roughly a 5x speed-up. I am comparing an emulated 8x16 shift to a native 16x8 shift to make this point: a developer using 16x8 shifts would assume that in changing their code to use 8x16 shifts they would be sacrificing lane-width in order to gain more lanes, but unwittingly they would incurring this 5x slow-down in some form. If we include 8x16 shifts in the spec we hide this slow-down until it negatively affects runtimes on certain machines--this is what I meant by creating a "performance cliff" when adding instructions for ease-of-use.

arunetm commented 4 years ago

Thanks, @abrown for the numbers. Quick question, Is vp9 using i16x8.shl in the IA version of their kernel and i8x16.shl on ARM?

zeux commented 4 years ago

I feel like in almost all cases the shift amount is going to be a constant in practice. For that, the v8 lowering is much less efficient compared to the optimum lowering available for SSE - shifting right by 2 for example is:

.LCPI0_0:
        .zero   16,63
shr2(long long __vector(2)):                           # @shr2(long long __vector(2))
        psrlw   xmm0, 2
        pand    xmm0, xmmword ptr [rip + .LCPI0_0]
        ret

(I believe v8 has some issues with constant loads but TBH I think this is a large limitation that will have to be lifted for efficient SIMD codegen of many instructions, including but not limited to v128.const and dynamic shuffle)

So I'm not sure that the performance cliff has to be quite as steep.

dtig commented 4 years ago

Thanks @abrown for initiating offline conversations and adding numbers. @zeux makes a good point that the performance cliff in practice is not expected to be this large. I've mentioned this here, and in other issues - the current V8 implementation is a prototype implementation that is behind a flag for experimentation, and is in some cases not the most optimal codegen for every opcode. We don't have a nice way to load a constant for when the codegen is more complex, and have to emulate that, using immediate optimizations for constant shifts, using correct AVX codegen, avoiding the SSE-AVX performance cliff etc. These are all on our radar and will be fixed in the future, but our highest priority at this stage is to make sure that all operations have a prototype implementation to test with. Which is why it's great to have multiple independent implementations as all the engines don't have the same limitations.

zeux commented 4 years ago

These are all on our radar and will be fixed in the future, but our highest priority at this stage is to make sure that all operations have a prototype implementation to test with.

Yeah I fully understand that (although we're going to start hitting impedance mismatch pretty soon - I've encountered this with v128.const where llvm is very eager to use v128.const; Chrome doesn't support it atm at all but if it does using the currently available means, LLVM will assume that replacing code with v128.const is a great idea for performance and Chrome will use inefficient codegen).

My point was that while the general 8-bit shift by a variable is a bit painful to implement, for the common case (shift by a constant) it's just one extra instruction for x64. So it seems perfectly reasonable to include this instruction into the baseline instruction set - emulating this instruction in "userspace" requires replicating the code above (with 16-bit shift and a constant mask) which is going to be same speed for x64 and worse for ARM.

In fact, if v8 wanted to fix the performance even for variable shifts, it would have been enough to have a 128-byte lookup table with 8 masks, and the code becomes something like

and     edi, 7
movd    xmm1, edi
psrlw   xmm0, xmm1
shl     rdi, 4
pand    xmm0, xmmword ptr [rdi + masks]

Still not great, but ~2x shorter than the currently emitted sequence (not sure what the perf characteristics are).

arunetm commented 4 years ago

Agree that we have a great case for keeping byte shifts if these are useful and const shifts make the common case. @zeux do you know of any workload that relies on byte shifts for performance? They are very helpful in understanding and optimizing for performance and for considerations relating to future hw arch changes.

zeux commented 4 years ago

@arunetm I don't have great examples where byte shifts are critical - this is why I voted 👀 . But I also voted 🎉 because I think orthogonality and symmetry in a SIMD instruction set are important, and I don't see the codegen here as being particularly problematic.

FWIW my geometry decompression code (used in meshoptimizer) that tends to be somewhat unusual in the SIMD facilities it uses does use byte-shift on ARM and word-shift on Intel instead; my current WASM port only uses word-shift because it was ported from the Intel version. I don't think it would matter too much whether I used byte shift or not one way or the other if the codegen was perfect for constants as I suggested.

mingqiusun commented 4 years ago

@zeux what is your rational for using byte-shift on ARM? do you see a speedup over using word-shift on ARM? How much?

zeux commented 4 years ago

@mingqiusun For my workload, byte-shift is "intuitively" a correct operation - it's just a natural implementation of the algorithm. Word-shift is a workaround for the lack of byte-shifts. I don't think there will be a noticeable performance difference on ARM between me using a byte-shift and a word-shift in that particular case since I need to do a bitwise and with a small constant after a few shifts are done. So the reason why I used a byte shift on ARM is that I was implementing the code for NEON from scratch (not porting from SSE), and byte shift was a more obviously correct approach. As I wrote in the previous comment, I don't think it would matter too much.

FWIW the exact code example on ARM (unpacking a sequence of 16 4-bit values) looks like this:

uint8x8x2_t sel44 = vzip_u8(vshr_n_u8(sel4, 4), vand_u8(sel4, vdup_n_u8(15)));
uint8x16_t sel = vcombine_u8(sel44.val[0], sel44.val[1]);

Whereas equivalent SSE code looks like this:

__m128i sel44 = _mm_unpacklo_epi8(_mm_srli_epi16(sel4, 4), sel4);
__m128i sel = _mm_and_si128(sel44, _mm_set1_epi8(15));

If I used word shifts on ARM I'd need to move the and after combine which maybe would have increased the latency and resulted in a slightly less optimal code but I'm not sure that's the case.

abrown commented 4 years ago

@zeux, thanks for the alternate codegen sequences. The point you make about constant shift amounts sounds intuitively right to me: I see a lot of code like a << 2 around. Thinking about Wasm (and cranelift specifically), I think we might need an optimization to track that values pushed to the stack are constant so that we can do what you describe, but this doesn't seem too hard to do (note to self: see if this has already been added to cranelift).

The thing I worry more about is the unintended consequence of a cache miss. I implemented the sequences you described, one for a constant shift amount (using a mask) and another for a dynamic shift amount (again using a mask, though I didn't generate the whole table), and observed that, for 10M iterations, the constant shift was only ~1.98x slower and the dynamic shift only ~2.33x slower than a 16x8 shift. This was better than the V8 sequences, but what if the mask was not in the cache? I added a CLFLUSH after this line and the constant shift then took ~205x longer than a 16x8 shift (that's ~103x slower than without the CLFLUSH, which seems inline with the cost of a miss on my machine). My assumption here is that CLFLUSH is not doing a write back to memory since that cache line is not dirty and this is just forcing the shift operation you described to do a memory read.

This worst-case scenario seems very bad performance-wise--how often would this worst-case happen? In cranelift at least those constants might be stored in the instruction cache because of how functions are emitted (constants are emitted after the function body) but that doesn't help my data cache where I would be looking for the mask. I suspect other runtimes may have similar issues.

zeux commented 4 years ago

@abrown Yeah - this is a valid concern. A couple of thoughts on this:

I$/D$ split is only at L1 level, so if you store the constants inline, and we assume the code section is largely cached in L2/L3, you're only looking at an L1 miss. The same risk applies to executing code otherwise - when you call a function, the function might not be in I$ (L1 miss) and the code page might not be in L2/L3 worst case. Your first execution pass will be expensive.

For constant shifts, the codegen I'm proposing just needs a single 16b lane per unique shift offset, I'd expect typical code to use 2-3 unique shift offsets. The full table is 128 bytes which is 2 cache lines. So if the code is hit semi-frequently, it seems likely that it will stay in cache, regardless of where the data is (next to code or separately).

This in general is very common in native codegen - compiler always extracts a lot of data to memory tables of various sort. Definitely in general if you can avoid loads from memory it helps for this reason and others, but I don't think avoiding data loads completely is a viable codegen strategy.

Finally, what I'm proposing isn't fundamentally different from v128.const. In theory, v128.const has the same caveat - it implies a load from some memory location, so the cost imbalance is very high - but I don't think we should be too concerned with this, at least before we get proof that this is a problem.

And as a side note, I think you will need to track constant-ness for other shifts as well - SSE has word shifts by an immediate value or by a vector, and always synthesizing vector shifts takes a few extra instructions.

abrown commented 4 years ago

@zeux, thanks for the detailed reply and apologies for the delayed response:

we assume the code section is largely cached in L2/L3, you're only looking at an L1 miss [...] Your first execution pass will be expensive.

This makes sense to me, especially in the Cranelift case. I guess the reason I have delayed a response is because I go back and forth on this a bit:

on the one hand, the V8 10-instruction code sequence seems bad, but it is predictably bad
on the other hand, the sequences you wrote out above (thanks again, BTW) range from a bit better than V8 (~4 L1 cache hit + ~2 inst. cycles) in hot loops to a bit worse most of the time (~12 L2 cache hit + ~ 2 inst. cycles) to occasionally very, very bad (100+ cycles to access memory); I used Skylake latencies for my example cycle counts.

Perhaps the CPU can soak up some of cache lookup cycles by doing other work but it is the "perhaps" that doesn't jive with my understanding of Wasm's predictable performance. Either option doesn't seem particularly good for x86 users,

what I'm proposing isn't fundamentally different from v128.const

Yes, and I have wondered if for larger lane sizes, e.g., it might not be faster on average to use pairs of mov imm + pinsr to generate the constants. But this clearly has code size and especially complexity implications. I think those make me lean toward your approach for now.

nfrechette commented 4 years ago

I agree with @zeux that loading constants with fewer instructions is generally the way to go. It improves the code density and by reducing the number of instructions and registers used, it increases the likelihood that a function will be inlined.

In the best case scenario, it should be faster, especially it it allows a function to inline that otherwise wouldn't. Even inlining aside, an L1 hit is fast and L3 isn't too bad as the CPU might very well be able to hide part of that latency with the surrounding instructions.

In the worst case scenario, sure, it might be considerably slower if we have a cache miss and no other work can be done to hide the latency. But I would think that this is very rare. Cache misses are common and here again, if a preceeding cache miss is hit a bit before our constant load, its latency will be largely hidden by the preceding cache miss (and perhaps entirely free).

Native code generation generally does lookup tables like this because in many scenarios with out of order cpus, it ends up being faster overall. Is it always optimal? Perhaps not but that isn't something the compiler can determine without runtime statistics/instrumentation. The compiler has to pick the most likely optimal generation sequence while ensuring that performance degrades gracefully if it proves to not be optimal. Here, a single cache miss that is likely hidden by whatever data is being worked on before it hits is a sensible choice.

If you look at it in a vacuum, the potential cache miss sounds like a bad idea but once its in real code, it isn't likely to negatively impact performance and much more likely to be more efficient.

zeux commented 4 years ago

on the one hand, the V8 10-instruction code sequence seems bad, but it is predictably bad on the other hand, the sequences you wrote out above [are...] occasionally very, very bad (100+ cycles to access memory)

I think it's incorrect to assume that 10-instruction code sequence is predictably bad - it will hit the memory latency if the sequence is cold, just as the sequence that loads constants, especially if the constant block is laid out before the function in question. It's the same really, the only difference is that you need 1 L2->L1 load for code (L2 -> I$) and 2 L2->L1 loads for code+data (L2 -> I$, D$). But this means you're going to pay an extra L2 latency cost which is 12 cycles, not 100+.

abrown commented 4 years ago

I think it's incorrect to assume that 10-instruction code sequence is predictably bad

I understand what your point about cache latency but my comparison for "bad" was against an ideal 1-instruction lowering with no memory access. When I compare our sequences against that, they certainly aren't "good."

zeux commented 4 years ago

Speaking of shifts by constants, I ran into an issue with v8 lowering non-byte shifts, which is very naive right now. I filed https://bugs.chromium.org/p/v8/issues/detail?id=10115 for this. I am assuming this can be fixed on the v8 side? In one of the loops I was tuning, I have 5 32-bit shifts by a constant; as a result, v8 generates 20 instructions to manage the shifts instead of 5 - my loop is ~75 instructions long so I am assuming that fixing this could improve the performance very noticeably.

If that's not the case, is it worth discussing an extra series of shift instructions with immediate operands? It's a large increase of the instruction space (we'll need 12 additional instructions), not sure if it's worthwhile if the engines can perform constant analysis.

abrown commented 4 years ago

I think it should be possible to determine that the shifts are by constant value so the extra shift + immediate instructions may not be necessary. I think the code you are referring to is https://github.com/v8/v8/blob/master/src/compiler/backend/x64/code-generator-x64.cc#L2983-L3000 (is it?) and has extra instructions due to having to check the shift value. I'm opened an issue to do the same type of improvement in cranelift: https://github.com/bytecodealliance/cranelift/issues/1341.

ngzhian commented 4 years ago

I commented on https://bugs.chromium.org/p/v8/issues/detail?id=10115, we are aware of this, and will fix this in future changes. Also see Deepti's comments in https://github.com/WebAssembly/simd/issues/117#issuecomment-562046318 Thanks for highlighting this!

zeux commented 4 years ago

Since this is fixed in v8 I think we can close this?

abrown commented 4 years ago

I've been saying in the other inefficiency issues that we should document these before closing. @dtig brought up that someone from Intel was going to look into creating such a document and I pinged @arunetm about it; I think we're going to discuss how things could look (taking into account your good work in your personal repo) and propose something.

WebAssembly / simd

Inefficient x64 codegen for 8x16 shifts #117