Closed tlively closed 4 years ago
If it's possible, it would be nice to get a better understanding of what implementation challenges V8 is experiencing in implementing v128.const. If it was working effectively and not through workarounds, it would make one of the most significant performance increases in the WebAssembly SIMD implementation.
It would also be nice to discuss feature-detection (#356) and perhaps as a side-effect, the frequency of releases and versions. Such might reduce the burden on #343 by setting quarterly releases with semantic versioning for instance and allow the standard to continuously evolve.
Unfortunately it isn't possible for us to commit to quarterly releases (or any schedule in particular) because we are beholden to the WebAssembly standardization process. We also discussed feature detection at the last meeting, but I'd be happy to discuss it again if we have any new developments there (and hopefully we will).
Unfortunately it isn't possible for us to commit to quarterly releases (or any schedule in particular) because we are beholden to the WebAssembly standardization process.
Usually there's a way that meets the working group requirements for standardization and allows you to implement new features with "versioning". In IETF, we often regularly release versioned draft RFCs such that someone can say they implement xyz standard and the draft revision dated yyyymmdd. I can check what the standard practice and course is for W3C if you'd like. This is generally a really common occurrence.
If possible, let's visit the items we were not able to get to at the previous meeting before revisiting feature detection 😄
I'd like to propose discussing our use case criteria: https://github.com/WebAssembly/simd/issues/203#issuecomment-706456418
Re: v128.const - I don't see an observable difference in my benchmark but this is for slightly subtle reasons:
Given C++ code that uses smth like wasm_i32x4_splat(127) in a loop, I see the following behavior wrt codegen (latest v8):
pre-v128.const: llvm synthesizes i32x4.const & i32.splat, v8 generates this:
000000E04B30303E 1e b90000803f movl rcx,000000003F800000
000000E04B303043 23 c5f96ec1 vmovd xmm0,rcx
000000E04B303047 27 c5f970c000 vpshufd xmm0,xmm0,0x0
post-v128.const: llvm synthesizes v128.const with 4 equivalent lanes, v8 generates this:
00000262FDC02FBE 1e 49ba0000803f0000803f REX.W movq r10,3F8000003F800000
00000262FDC02FC8 28 c4c1f96ec2 vmovq xmm0,r10
00000262FDC02FCD 2d 4c8b15ecffffff REX.W movq r10,[rip+0xffffffec]
00000262FDC02FD4 34 c4c3f922c201 vpinsrq xmm0,xmm0,r10,0x1
The second sequence is a 3-cycle sequence whereas the first one is a 2-cycle. So when running in context of a tight loop (which my loop happens to be), I'd expect to see a measurable performance delta.
However, in both cases v8 actually lifts the computation outside of the loop... So the difference is nil, as the code above executes once in the loop prologue.
Manually transplanting the computation back into the loop body for the purposes of profiling with llvm-mc shows 4.68 cycles per iteration without using v128.const, and 4.95 cycles per iteration after using v128.const, so some of the cost is hidden by other instructions, but the total impact should have been a 6% degradation in throughput. However because v8 lifts this outside of the loop, on my kernels the impact isn't observable.
I'm still concerned about the potential to lose performance here when v8 doesn't figure it out, but just wanted to close the loop (ha!) here.
llvm-mc study for my kernel (which runs at circa 12 GB/s): https://gcc.godbolt.org/z/YYWf1d
@zeux @tlively Here's a godbolt that shows how to force load a constant from memory. https://godbolt.org/z/d1rvfv
LLVM doesn't always get the cost modeling right -- (I've seen it decide to reload it a couple times over inside a loop), but it is possible to do if you know you want it to work that way.
Yeah, volatile is often a reasonable workaround for codegen issues although you have to apply it carefully to not cause extra loads. I'm not an expert on the C volatile semantics but I believe it might be necessary for the compiler to disable elimination of duplicate loads to a volatile.
@ngzhian So I understand that RIP-relative loads require more work on the v8 side that may not be trivial, but would it be possible to, as part of v128.const codegen, identify cases where:
If v8 did this this would at least equalize performance and size of generated code between cases where LLVM decides to emit v128.const vs when it decides to synthesize it; it would still result in suboptimal lowering in some cases vs a rip-relative load but this would be better than the status quo and hopefully easy to implement?
However, in both cases v8 actually lifts the computation outside of the loop... So the difference is nil, as the code above executes once in the loop prologue.
Good to know, I am not that familiar with the actual optimizations happening in the TurboFan engine :) (just checking, is v8 doing the lifting? or is it emscripten/binaryen?)
So I understand that RIP-relative loads require more work on the v8 side that may not be trivial
I agree this will be useful and hope that we can dedicate time to properly work this out. Thank you for understanding 👍
hopefully easy to implement?
You suggestions are very reasonable, sounds a lot like the shuffles matching we already did :) I have https://crbug.com/v8/10980 tracking loading constants from memory, I have also filed https://crbug.com/v8/11033 to track your suggestion. Thanks!
We're moving to a biweekly schedule for these syncs, so the next meeting will be Friday, October 16 at 9:00AM - 10:00AM PDT/ 6:00PM - 7:00PM CEST). Please respond with agenda items you would like to discuss.
If this meeting doesn't already appear on your calendar, or you are a new attendee, please fill out this form to attend.
Carryover items from the last meeting include: