WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.4k stars 693 forks source link

Floating-point rounding mode control prototyping #1456

Open KloudKoder opened 2 years ago

KloudKoder commented 2 years ago

This is the issue for testing out approaches to floating-point rounding mode control, which is required for performant interval arithmetic. The original informal discussion is found here.

For an understanding of the practical ramifications of interval arithmetic including some use cases, I suggest this YouTube presentation by G. Bill Walster.

For starters, Thomas Lively identified the Boost library, boost::interval. An example of its rounding control approach is here.

Feel free to add references to practical use cases or libraries related to intervals. The ultimate output of this issue, hopefully, will be a performance test. Conrad Watt has suggested a switch-based approach (for testing, not actual implementation) for every floating-point instruction, wherein the subject of the switch is a rounding control (RC) global variable (nearest-or-even, round toward negative infinity, round toward positive infinity, or chop toward zero). His hypothesis is that the WASM compiler will eliminate many of the switch statements because the rounding mode (RM) at any given point is usually known at compile time. This might at least allow us to do some crude benchmarking.

A more painful but performant approach would be to have the same virtualized RC register, but to reorder instructions where possible in order to minimize the frequency of touching the actual hardware RC. This would also compile seamlessly in architectures in which the RM is embedded in the floating-point instruction opcode itself (in which case the reordering would be redundant).

At this point, we're only concerned with RISC, but as Deepti Gandluri pointed out, any mature implementation would need to address the same issues with SIMD, wherein each parallel operand adheres to a common but custom RM.

@conrad-watt @tlively @jakobkummerow @dschuff @dtig

KloudKoder commented 10 months ago

@mglisse @Chris00 Thanks for your feedback. So it looks like ceil should be the favored way.

"The one thing that would suck would be for 2 compilers or browsers to optimize only one direction but not the same one."

Indeed. whirlicote was concerned about that as well. He also plans to include some reference code in public notes to the spec, in order to ensure that everyone is optimizing for the same expected compiler behavior, which I think will further serve to avoid your feared outcome.

For updates on the proposal's progress, see:

https://github.com/WebAssembly/rounding-mode-control/issues/2

jakobkummerow commented 10 months ago

I'm not sure I fully understand the implications of the last handful of messages, but if you're saying that you're adding (among others) f32.sqrt_ceil and f32.sqrt_floor such that only the _ceil version will be fast and everyone will be expected to use that (with user-space tricks to map any desired rounding behavior onto this one instruction), and the _floor version only exists for completeness but nobody should use it because it will be slow, then that seems like a very suboptimal outcome, because it would be a very surprising performance pitfall. So in that case, I'd postulate that the _floor version shouldn't exist at all, to prevent folks from accidentally using it.

Taking a step back, this would be an example of a very important effect of prototyping: if it turns out that a design that looked good on paper is either not feasible (e.g., because changing the control mode very frequently is too expensive on actual hardware) or not necessary in practice, then this feedback should definitely be used to make changes to the original design.

(If I'm misunderstanding what you're saying, then never mind and carry on.)

KloudKoder commented 10 months ago

@jakobkummerow It's my fault for not explaining better. What we're proposing is that add, subtract, multiply, and divide will not be optimizable for floor because their floor flavors will be implemented as negate+ceil+negate. In practice, this means that interval arithmetic will be fast because we avoid RM switching with these common operations. sqrt (and any future transcendental functions) will not be affected because there is no economical trick to pull; we will have to pay the full cost of RM switching if we want to change from ceil to floor and back. In the long term, there could be another proposal to reorder the code so as to elide as much of that switching as possible. Or, perhaps more realistically, the CPU marketshare pie will evolve in favor of those chips which feature per-instruction RM decorations (such as nVidia or RISC-V) rather than this global RM foolishness.

jakobkummerow commented 10 months ago

I was just using sqrt as an example. My larger point still stands: if f32.add_floor will be far slower than f32.add_ceil and nobody is supposed/expected to use it, then it probably shouldn't exist in the first place.

KloudKoder commented 10 months ago

@jakobkummerow OK I see your point. So while there won't be any significant performance difference with sqrt (or future transcendentals), there will in fact be superior performance with basic arithmetic using ceil as opposed to floor (when you use appropriate hints to optimize for ceil). There will be no equivalent hint available for optimizing for floor. However, and I know this sounds counterintuitive, floor will be as fast as it can possibly be if you optimize for ceil. This is because, if we got rid of basic arithmetic with floor, then each developer would need to implement negate+ceil+negate at the source code level, resulting in performance that's at least as poor, combined with lots more verbosity and thus opportunity for error; or else drop the optimization entirely and let us change and restore the global RM on every RM-sensitive instruction, which would be way worse still.

You might rightly wonder why floor and ceil should not have symmetric performance. The reason is that the only way to enforce that symmetry is to change the RM to ceil, then back to floor, then back to ceil, etc. all over the place; whereas, it's actually faster just to choose one of them and stick with it. In other words, the negates end up being way cheaper than all the would-be RM switching. That's likely to be the case even if we use heuristics and instruction reordering in order to minimize the number of switches required, simply because the most common use case (interval arithmetic) needs both of them for each and every basic operation.

jakobkummerow commented 10 months ago

each developer would need to implement negate+ceil+negate at the source code level

Keep in mind that Wasm modules are toolchain-generated, not handwritten. It is perfectly acceptable that toolchains have to do some work.

IIUC you're suggesting that engines should implement add_floor as negate+add_ceil+negate. Which creates the problems mentioned above: all engines must do this the same way, and toolchain authors must be aware that that's what engines do -- it would be really unfortunate if some toolchain decided to express add_ceil as negate + add_floor + negate, in a well-meaning attempt to avoid costly RM switches on naive engines. This reinforces my thinking that a more robust way to ensure consistency across the ecosystem is to not have add_floor, thereby forcing toolchains to emit negate+ceil+negate when the original source code requests flooring, and hence making sure everyone is on the same fast path. As an added benefit, AOT optimizers in toolchains can afford to do more costly analysis (to e.g. reduce the number of required negations) than JIT compilers in engines.

whirlicote commented 10 months ago

such that only the _ceil version will be fast and everyone will be expected to use that

No. The difference in performance is only small. Something like two negate instructions. (Which is neglectable to a pipeline flush)

and the _floor version only exists for completeness but nobody should use it because it will be slow, then that seems like a very sub-optimal outcome

No. I expect _floor to be still faster than the negate-ceil-negate on hardware with support of AVX-512.

because it would be a very surprising performance pitfall.

No. The performance penalty is only about two negations (on certain hardware).

So in that case, I'd postulate that the _floor version shouldn't exist at all, to prevent folks from accidentally using it.

Keep in mind that Wasm modules are toolchain-generated, not handwritten. It is perfectly acceptable that toolchains have to do some work.

I was just using sqrt as an example. My larger point still stands: if f32.add_floor will be far slower than f32.add_ceil and nobody is supposed/expected to use it, then it probably shouldn't exist in the first place.

It is not far slower. You are expected to use _floor and _ceil. This is to make sure that the same code runs fast on different hardware.

This reinforces my thinking that a more robust way to ensure consistency across the ecosystem is to not have add_floor, thereby forcing toolchains to emit negate+ceil+negate when the original source code requests flooring, and hence making sure everyone is on the same fast path.

This optimization is only relevant on certain hardware for certain use cases. It only works for interval arithmetic heavy code for example. It is an easy but effective optimization for the production engines. Concerning this proposal it only fits into a NOTE in the spec.

and hence making sure everyone is on the same fast path.

I belive this problem can be solved with documentation or simply trying out both variants and measure the time whats faster in ones code.