Open sunfishcode opened 3 years ago
If these instructions were made into a proposal of their own and not tied into relaxed simd, it might make sense for a wasm implementation to provide them only on hardware that has fma natively, and let the wasm program provide its own fallback for other hardware?
If these instructions were made into a proposal of their own and not tied into relaxed simd, it might make sense for a wasm implementation to provide them only on hardware that has fma natively, and let the wasm program provide its own fallback for other hardware?
Possibly, however what would they do for a fallback? If they'd fall back to non-fused mul+add, that's really a qfma use case. If they'd fall back to a software fma, that's really an fma use case.
If they'd fall back to custom logic, that's a slightly different kind of qfma use case: do a qfma with carefully selected operands such that the result tells you whether it did a true fma or not, and if not, run the custom logic.
I think a deterministic FMA is quite nice, and can stand as a proposal on its own, and as time goes by, FMA coverage should get better (not too sure about low end phones/embedded space.)
However, I've heard that a slow FMA is not useful (@Maratyszcza to correct me, probably more so in the ML world?), so programs will choose an alternative: change algorithms, or tweak it to manage with reduced precision.
I think if we want include a deterministic FMA, we also need a way for programs to check if FMA support is available in hardware.
I think if we want include a deterministic FMA, we also need a way for programs to check if FMA support is available in hardware.
That's why I suggested that the instruction should not be available if the hardware does not support FMA, thus allowing a feature-test of the instruction to be the test for whether the hardware supports FMA. True, some implementations could choose to provide FMA anyway, but perhaps there's a spec-level fix for that?
The qfma instruction, as currently proposed, already supports the feature-detection use case. Users can run a qfma with certain operands, and test whether the result is the fma result. This is easier than managing multiple wasm modules in order to feature-test and then pick the module to instantiate.
To summarize the use cases we want to support:
qfma covers all 3 use cases, since qfma with specific operands can detect hardware support. Use case 1 can ship their own fma implementation if they really need it.
Deterministic FMA is a convenience for 1, instead of shipping their own, they rely on engine implementation.
Some floating-point algorithms depend on a true fma, which is a different use case from qfma.
@sunfishcode can you add links to some of these, I am unfamiliar with this domain.
I see a couple of issues with this proposal.
First, FMA is a not a SIMD-specific instruction, but a fundamental floating-point operation, and all processors which provide SIMD version of the FMA instruction provide a scalar one too. Thus, it would be more appropriate to have FMA as a standalone proposal featuring both scalar and SIMD versions of the instructions rather than a part of Relaxed SIMD proposal, especially given that it is not processor-specific.
Secondly, I'd like to see a full version of SIMD FMA polyfill using either WAsm SIMD128 or x86 SSE4.1 intrinsics. The polyfill suggested in WebAssembly/design#1391 omits too many details, and the complete polyfills in musl and Apple LibM linked from WebAssembly/design#1391 are scalar and clearly not SIMD-friendly. AFAIK, the main application of the real FMA is for simulating higher than double-precision computations using double-double and similar representations. However, given that WAsm SIMD is limited to 128 bits and WAsm natively supports 64-bit computations, I expect that soft-float emulation of quad-precision arithmetics would be faster on pre-FMA processors than double-double SIMD operations with emulated FMA, and thus there isn't really a convincing use-case for exposing guaranteed-fused multiply-add in WAsm.
None of qfma, relaxed/asymmetric min/max, or reciprocal sqrt approximation, are SIMD-specific either; should those instructions be moved out to a separate proposal as well? Instead, I imagine the "simd" in "relaxed-simd" isn't something important to be strict about. Scalar versions of SIMD operations are useful for remainder loops. As such, it would make sense for relaxed-simd to define scalar versions of qfma and the others as well.
It's true that fma isn't "relaxed" either, but my observation above is that there may be value in adding regular fma at the same time as qfma, and I'm interested in feedback on this.
A simple polyfill for simd fma would be to extract the elements and evaluate it in scalar. That's Not Fast, but it's already understood that FMA polyfills won't be Fast, so that might be good enough as a minimum-effort implementation technique.
C's fma
(7.12.13.1), Rust's mul_add
, C++'s fma
, Go's fma
, C#'s FusedMultiplyAdd
, and similar functions in other languages' standard libraries are specified to do a true fma, so there'd be many places where toolchains could use an fma instruction. Programmers may want a way to test if those are fast, and such tests can be implemented using the qfma
instruction, testing to see if it returns the same result as fma
(we could even document this relationship, non-normatively).
I believe another advantage of deterministic fma
is that it may allow us to relax the way this proposal handles non-determinism to align with the existing handling of floating-point NaN (i.e. no need for the new "column" approach). WRT the comment above, currently we need qfma
to respect "column" non-determinism in order to faciliate (1)
, but with fma
and specified-fully-non-deterministic qfma
, (1) can be handled by fma
, (2) can be handled by the heuristic test on qfma
sketched by @sunfishcode just above, and (3) can be handled by qfma
.
If possible, this would be a win for spec hygiene (no need for a new kind of non-determinism) and would fully address the concerns @rossberg and I raised at the phase 3 vote. Although, if there are other reasons we need the current "column" approach that I'm missing, please point these out!
I agree with @conrad-watt. Getting rid of the intrusive new form of non-determinism (a.k.a. implementation-dependent semantics) would make me much less concerned about this proposal.
Non-quasi, guaranteed-fused FMA:
This proposal is to add these in addition to
qfma
, not instead of it.IEEE 754
fusedMultiplyAdd
, and obvious subtract variant, with modifications to NaN and exception behavior as in other floating-point instructions in wasm.On x86-64 CPUs with FMA3 or FMA4, and ARM64, and other popular architectures, there is a single instruction that does this. On CPUs without an fma instruction, some options are discussed here.
Since wasm hides floating-point exception flags, and NaN bits are already nondeterministic, the only new differences across platforms are timings.
Some floating-point algorithms depend on a true fma, which is a different use case from qfma. And, some use cases want to be able to specify determinism in the wasm module, independently of whether the host implementation is enforcing determinism.
As discussed here, it seems to make more sense to add explicit instructions for these use cases, rather than using profiles to restrict qfma to work for these use cases.
I expect one of the big questions is whether these instructions belong in relaxed-simd or should go in a separate proposal. I'm open to suggestions here. I'm starting by proposing them here, because I expect it would be confusing to users if relaxed-simd is standardized with qfma before a true fma is standardized. If a user needs a true fma, they might be tempted to use qfma if they don't (think they) care about CPUs without fma support.