Agenda for sync meeting 10/16/2020

tlively commented 4 years ago

We're moving to a biweekly schedule for these syncs, so the next meeting will be Friday, October 16 at 9:00AM - 10:00AM PDT/ 6:00PM - 7:00PM CEST). Please respond with agenda items you would like to discuss.

If this meeting doesn't already appear on your calendar, or you are a new attendee, please fill out this form to attend.

Carryover items from the last meeting include:

Consider new data and names for v128.load{32,64}_zero (#237)
Removing not-equals instructions (#351)
Checking lane indexes during parse rather than validation (#256)
Whether v128 be a number type in the spec (#363)
Finalizing the instruction set (#343)

omnisip commented 4 years ago

If it's possible, it would be nice to get a better understanding of what implementation challenges V8 is experiencing in implementing v128.const. If it was working effectively and not through workarounds, it would make one of the most significant performance increases in the WebAssembly SIMD implementation.

It would also be nice to discuss feature-detection (#356) and perhaps as a side-effect, the frequency of releases and versions. Such might reduce the burden on #343 by setting quarterly releases with semantic versioning for instance and allow the standard to continuously evolve.

tlively commented 4 years ago

Unfortunately it isn't possible for us to commit to quarterly releases (or any schedule in particular) because we are beholden to the WebAssembly standardization process. We also discussed feature detection at the last meeting, but I'd be happy to discuss it again if we have any new developments there (and hopefully we will).

omnisip commented 4 years ago

Unfortunately it isn't possible for us to commit to quarterly releases (or any schedule in particular) because we are beholden to the WebAssembly standardization process.

Usually there's a way that meets the working group requirements for standardization and allows you to implement new features with "versioning". In IETF, we often regularly release versioned draft RFCs such that someone can say they implement xyz standard and the draft revision dated yyyymmdd. I can check what the standard practice and course is for W3C if you'd like. This is generally a really common occurrence.

penzn commented 4 years ago

If possible, let's visit the items we were not able to get to at the previous meeting before revisiting feature detection 😄

penzn commented 4 years ago

I'd like to propose discussing our use case criteria: https://github.com/WebAssembly/simd/issues/203#issuecomment-706456418

zeux commented 4 years ago

Re: v128.const - I don't see an observable difference in my benchmark but this is for slightly subtle reasons:

Given C++ code that uses smth like wasm_i32x4_splat(127) in a loop, I see the following behavior wrt codegen (latest v8):

pre-v128.const: llvm synthesizes i32x4.const & i32.splat, v8 generates this:

000000E04B30303E    1e  b90000803f     movl rcx,000000003F800000
000000E04B303043    23  c5f96ec1       vmovd xmm0,rcx
000000E04B303047    27  c5f970c000     vpshufd xmm0,xmm0,0x0

post-v128.const: llvm synthesizes v128.const with 4 equivalent lanes, v8 generates this:

00000262FDC02FBE    1e  49ba0000803f0000803f REX.W movq r10,3F8000003F800000
00000262FDC02FC8    28  c4c1f96ec2     vmovq xmm0,r10
00000262FDC02FCD    2d  4c8b15ecffffff REX.W movq r10,[rip+0xffffffec]
00000262FDC02FD4    34  c4c3f922c201   vpinsrq xmm0,xmm0,r10,0x1

The second sequence is a 3-cycle sequence whereas the first one is a 2-cycle. So when running in context of a tight loop (which my loop happens to be), I'd expect to see a measurable performance delta.

However, in both cases v8 actually lifts the computation outside of the loop... So the difference is nil, as the code above executes once in the loop prologue.

Manually transplanting the computation back into the loop body for the purposes of profiling with llvm-mc shows 4.68 cycles per iteration without using v128.const, and 4.95 cycles per iteration after using v128.const, so some of the cost is hidden by other instructions, but the total impact should have been a 6% degradation in throughput. However because v8 lifts this outside of the loop, on my kernels the impact isn't observable.

I'm still concerned about the potential to lose performance here when v8 doesn't figure it out, but just wanted to close the loop (ha!) here.

llvm-mc study for my kernel (which runs at circa 12 GB/s): https://gcc.godbolt.org/z/YYWf1d

omnisip commented 4 years ago

@zeux @tlively Here's a godbolt that shows how to force load a constant from memory. https://godbolt.org/z/d1rvfv

LLVM doesn't always get the cost modeling right -- (I've seen it decide to reload it a couple times over inside a loop), but it is possible to do if you know you want it to work that way.

zeux commented 4 years ago

Yeah, volatile is often a reasonable workaround for codegen issues although you have to apply it carefully to not cause extra loads. I'm not an expert on the C volatile semantics but I believe it might be necessary for the compiler to disable elimination of duplicate loads to a volatile.

@ngzhian So I understand that RIP-relative loads require more work on the v8 side that may not be trivial, but would it be possible to, as part of v128.const codegen, identify cases where:

All 4 lanes that are being loaded are identical, and use a 32-bit move + vmov + shuffle;
First pair of lanes and second pair of lanes are identical, and use a 64-bit move + vmov + shuffle?

If v8 did this this would at least equalize performance and size of generated code between cases where LLVM decides to emit v128.const vs when it decides to synthesize it; it would still result in suboptimal lowering in some cases vs a rip-relative load but this would be better than the status quo and hopefully easy to implement?

ngzhian commented 4 years ago

However, in both cases v8 actually lifts the computation outside of the loop... So the difference is nil, as the code above executes once in the loop prologue.

Good to know, I am not that familiar with the actual optimizations happening in the TurboFan engine :) (just checking, is v8 doing the lifting? or is it emscripten/binaryen?)

So I understand that RIP-relative loads require more work on the v8 side that may not be trivial

I agree this will be useful and hope that we can dedicate time to properly work this out. Thank you for understanding 👍

hopefully easy to implement?

You suggestions are very reasonable, sounds a lot like the shuffles matching we already did :) I have https://crbug.com/v8/10980 tracking loading constants from memory, I have also filed https://crbug.com/v8/11033 to track your suggestion. Thanks!

tlively commented 4 years ago

Thanks everyone! Here are the notes.

WebAssembly / simd

Agenda for sync meeting 10/16/2020 #369