SIMD subgroup meeting on 2022-09-30

ngzhian commented 2 years ago

This is a joint meeting with flexible-vectors and any SIMD related presentations are also welcome, e.g. new architectures or toolchains, longer term discussion that don't fit into flexible-vectors or relaxed-simd.

The meeting will be on Friday, September 30 at 9:00AM - 10:00AM PDT/ 5:00PM - 6:00PM CET.

If this meeting doesn't already appear on your calendar, or you are a new attendee, please fill out this form to attend.

ngzhian commented 2 years ago

@penzn fyi

ngzhian commented 2 years ago

@penzn I will be OOO on this day, can you please run it?

penzn commented 2 years ago

Yes, I will.

penzn commented 2 years ago

Seems like there are no agenda items, should we cancel? Keep in mind that I can't send cancellation notices.

dtig commented 2 years ago

Given that this is the last meeting before the in-person CG meeting and a phase 4 poll for relaxed SIMD, should we wrap up the discussion on https://github.com/WebAssembly/relaxed-simd/discussions/85? We could include a poll ratifying the final instruction set at that meeting, and continue async discussion too. I just wanted to make a note here that we won't have another subgroup meeting before the final vote.

penzn commented 2 years ago

Seems like a good idea, let's meet.

dtig commented 2 years ago

Sounds good, as @ngzhian is out, let's make sure that @Maratyszcza is able to attend so if we do take any votes one of the champions will be present.

penzn commented 2 years ago

On that note, given that we are approaching phase 4, I am curious what is the resolution of #71 (encoding non-determinism), though this might be a question to @ngzhian.

Maratyszcza commented 2 years ago

I will attend tomorrow

conrad-watt commented 2 years ago

I'd also like to attend tomorrow to discuss the non-determinism point that @penzn mentioned in advance of the phase 4 vote. I've just filled out the form linked above.

EDIT: commented here regarding a relevant point I'd like to discuss.

Maratyszcza commented 2 years ago

@conrad-watt Sent a meeting invite to the email in your GitHub profile.

conrad-watt commented 2 years ago

I'll post a better summation comment later, but for now here's a raw link to my slides.

penzn commented 2 years ago

PR to add notes is open WebAssembly/meetings#1122

I suggest we have one more meeting before the in-person meeting. @dtig, @ngzhian what do you think?

conrad-watt commented 2 years ago

This comment is to sum up my position and attempt to distill some conversations and points that were raised at the previous meeting.

I'm focussing on the fma discussion here since I believe it's the only example in this proposal where we're relying on the precision of the current "list" non-determinism in comparison to the more permissive "set" non-determinism. If this belief is wrong, please correct me!

big picture

At the big picture level, there seem to be three values in tension here - one might want to...

ensure that "reasonable" programs and compilation schemes are semantically well-behaved (that is, staying within the envelope of guarantees provided by the specification).
avoid the need for a new version of non-determinism in the specification which requires persistent implementation/platform "knowledge".
avoid instructions with significantly inconsistent performance across platforms.

There seem to be two reasonable ways forward:

(a). the incumbent "list" non-determinism, which compromises on (2)
(b). my suggested "set" non-determinism (i.e. like Wasm's existing floating point NaN non-determinism) with single-rounding fma, which compromises on (3)

To be clear, while I'm advocating for solution (b), I think solution (a) is still ok. There's also a space of solutions I (currently) wouldn't be happy with, such as relaxing to set non-determinism without adding single-rounding fma, which AFAIU would completely torpedo value (1) above.

fma uses

There are three main scenarios for fma use that we care about, as noted by @ngzhian in this comment.

In the first scenario, a source program assumes the existence of a single-rounding fma intrinsic/primitive. To implement this primitive, solution (a) must compile two modules and dynamically determine which one to instantiate based on the rounding behaviour of qfma. If qfma is not single-rounding, the single-rounding fma primitive must be software-emulated at the Wasm level. In solution (b), the compilation scheme can just use the deterministic fma instruction, essentially deferring potential software emulation to the engine.

In the second scenario, a source program has two paths - one where the fma primitive is single-rounding and fast, and one where it is slow/absent. In this scenario, both solution (a) and solution (b) test qfma to see which path to take.

In the third "don't care" scenario, both solutions can just use qfma.

points during discussion

list non-determinism makes testing easier

If I understand correctly, the point is that code using qfma can be currently compared against an oracle which checks both the "always single-rounding" and "always double-rounding" outcomes, instead of needed to worry about the spec's allowance of arbitrary choice. Pragmatically, such an oracle could still be used even if the spec allows set non-determinism - the oracle would technically be formally imprecise but any discovered inconsistencies would still be valuable as implementations would be expected to consistently apply one rounding choice (i.e. the inconsistency is likely a program bug). It's also worth noting that this approach doesn't scale well to multiple list non-determinism instructions, where all combinations of list choice across the instructions would need to be considered.

we have already given up on consistent performance to some extent (e.g. popcnt)
counterpoint - software-emulated single-rounding fma may be so slow that you would never want it to actually execute

It would be good to know the extent to which this latter point is true. If toolchains could find value in targetting single-rounding Wasm-level fma even if it may be software-emulated, that would be a point in favour of solution (b). It's also worth noting that solution (a) must also software-emulate single-rounding fma in the first scenario at the Wasm level, given @ngzhian's point that we want to support such code (i.e. instead of instruction-level performance inconsistency, we have the same performance inconsistency inserted at the compilation scheme level).

single-rounding fma isn't vector-specific and would represent scope-creep of the "relaxed SIMD" proposal

I view this instruction as helping the relaxed SIMD proposal to require a less ambitious extension to the specification's concept of non-determinism, since I don't think a design where we generalise to set non-determinism is acceptable without it. There is a small aesthetic wrinkle in lacking scalar fma, but we can always consider it as a separate proposal (if ever), and I'd be happy to make this argument to the wider group during the phase 4 vote if anyone complains.

other points regarding set non-determinism

it may help support code migration more efficiently
it is a conservative choice (modulo the new fma instruction requirement), in the sense that refining the specification to list non-determinism later is web-compatible

tlively commented 2 years ago

Thanks for the excellent write up, @conrad-watt!

In solution (b), the compilation scheme can just use the deterministic fma instruction, essentially deferring potential software emulation to the engine.

My understanding (and @Maratyszcza, please correct me if I'm wrong) is that in all cases there is a better option than emulating single-rounding vector fma, so no program would actually want to use this proposed fma instruction.

In hypothetical code migration scenarios, code that starts on a fast-fma machine and goes into the code path using the fma instruction but then gets migrated to a slow-fma machine would fall off such a catastrophic performance cliff that the migration could not be seen as anything other than a bug. List non-determinism usefully formalizes this as a bug.

The reasonable fixes for that bug (no matter what the spec says) would be for the engine provider to 1) limit migration to pools of machines offering the same native semantics or 2) to always use the reference implementation of (q)fma. Both of these options at least provide "predictable" performance (i.e. no performance cliffs) and both of them make qfma an acceptable alternative to fma.

I do agree that it would be nice if we didn't have to spec list non-determinism, but I also see it as a useful part of the proposal that carries its own weight. In contrast, I would consider a deterministic fma instruction to be a large footgun that is not useful in practice.

conrad-watt commented 2 years ago

My understanding (and @Maratyszcza, please correct me if I'm wrong) is that in all cases there is a better option than emulating single-rounding vector fma, so no program would actually want to use this proposed fma instruction.

I can believe this, but there seems to be some dissonance with scenario (1) in @ngzhian's comment - such code would either have to fail to compile in the current solution, or have a software emulation path predicated on testing qfma that's at least as slow as engine-emulated fma. So either we're already ok with such code having worse performance characteristics on architectures where qfma is double-rounding, or we're actually not intending to support such code (because the slow path is slow enough to be a "bug").

In hypothetical code migration scenarios, code that starts on a fast-fma machine and goes into the code path using the fma instruction but then gets migrated to a slow-fma machine would fall off such a catastrophic performance cliff that the migration could not be seen as anything other than a bug. List non-determinism usefully formalizes this as a bug.

With list non-determinism, if the code starts on a machine which has single-rounding qfma, any migrations must preserve this even if the migrated-to machine can't implement it efficiently, and even if the code doesn't care - so I think the performance cliffs in code migration are more hazardous with the current solution, which can't distinguish "care" vs "don't care" uses of qfma. With solution (b) qfma can be migrated and switch from single- to double-rounding, and uses of fma explicitly mean "care", at the risk of a performance cliff.

tlively commented 2 years ago

With list non-determinism, if the code starts on a machine which has single-rounding qfma, any migrations must preserve this even if the migrated-to machine can't implement it efficiently, and even if the code doesn't care - so I think the performance cliffs in code migration are more hazardous with the current solution, which can't distinguish "care" vs "don't care" uses of qfma.

This would be allowed according to the spec, but is user-hostile enough that I expect code-migrating engine providers would provide one of the two solutions I described instead, so I would be surprised if this becomes a concern in practice (assuming we ever get code migrating engines that would make any of this a concern in practice).

I don't want to get too deep in the weeds here since it's all hypothetical, but I do think it's reasonable to assume as a starting point that engine providers and their users are aligned in trying to prevent performance cliffs.

rossberg commented 2 years ago

tlively:

I do agree that it would be nice if we didn't have to spec list non-determinism, but I also see it as a useful part of the proposal that carries its own weight.

I have to wholeheartedly disagree. It's not a useful but a particularly harmful part of this proposal. The technical term "list non-determinism" obscures what it really is: the introduction of implementation-dependent behaviour.

This destroys one of the basic value propositions of Wasm, that also happens to be one of its most popular features, namely that it's 100% portable, i.e., the (guaranteed) semantics is the same everywhere. As such, this extension has much broader (negative) impact beyond just SIMD, and I'm still highly concerned about it. If there is an even half-way decent alternative, then we should take that.

tlively commented 2 years ago

The entire point of this proposal (no matter how we spec it) is to introduce implementation-defined behavior in situations where the performance win is worth it, and embracing that in the spec better captures both its intent and how it will be used in practice.

Applications and engines for which this is a bad trade off can continue to not use or provide relaxed SIMD.

@rossberg, can you explain how it’s any better to specify this implementation-defined behavior as set non-determinism rather than list non-determinism if that doesn’t change the assumptions programmers make or the way the proposal is used? AFAIU, we were discussing switching to set non-determinism only because it would be cleaner and more convenient to have only one kind of non-determinism in the spec, not because it somehow avoided having implementation-defined behavior.

dtig commented 2 years ago

I think it'll be useful to have specifics to look at from the application perspective, @ngzhian /@Maratyszcza I remember there were code links previously shared for how applications would use this? Would you be able to link it here?

This proposal does expose some implementation defined behavior regardless of how we decide to specify the proposal, so the core of the discussion seems to be what exactly do we want the spec to reflect? I have a mild preference for keeping the Spec flexible, i.e. set non-determinism as that seems to be more flexible of the two approaches, and we can always go towards option (a), but not the other way around. Concretely I'd like to make sure we understand what's involved to have only option (b), with option (a) being the status quo, which means that we won't need additional work apart from what's already in the tools/engines/spec (would probably be useful to have data points from @ngzhian about where we landed on post-phase 3 feedback as well). The intent is to understand how much work is involved across the layers, and if it would be worth supporting it vs. pushing for option(a). To only support option (b),

Include a deterministic FMA
Instructions that lower to baseline FMA implementations, will need a deterministic lowering
Engines need to support both FMA(possibly by calling out to the runtime)/QFMA
The tools portion of this is less clear to me. Aside from an additional instruction being supported @conrad-watt mentioned something about different code in the tools that would target which FMA instruction to generate, perhaps I missed or misunderstood something?

In the ideal case, applications know how to handle FMA/QFMA which this proposal already assumes they would need to. In the case where these instructions aren't used correctly, my opinion is that working with terrible performance is still better than erroring out, or unexpected results.

conrad-watt commented 2 years ago

something about different code in the tools that would target which FMA instruction to generate, perhaps I missed or misunderstood something?

For a toolchain currently built around option (a), it should be fairly simple to switch to option (b). Currently it's already necessary for the toolchain to condition on the behaviour of qfma both to implement a "guaranteed single-rounding fma" primitive, and a "single-rounding fma is fast" check, and the structure of these checks don't change (or get eliminated) with option (b). If the toolchain is currently not implementing these primitives and just doing a "don't care" compilation to qfma, it can continue to do so.

I also think it's worth emphasising again that option (a) delivers no more performance to code that relies on a single-rounding fma primitive than option (b). Instead it's a case of whether the performance cliff is inserted as separate Wasm-level paths by the toolchain (option (a), testing qfma and falling back to software emulation), or by the engine/language level (option (b), implementing fma though emulation, but still countenanced by the toolchain emitting fma instead of qfma). This gets back to the "value 2" vs "value 3" tension I outlined above. If option (b) is too slow here, then so is option (a) (at least at a program level).

tlively commented 2 years ago

I will note that currently no toolchain reasons about any of this and that all relaxed SIMD usage (that I'm aware of) comes from users hand writing either intrinsics that are opaque to the compiler or raw assembly. In my experience, SIMD programmers care deeply about what particular engines do in practice, so I'm skeptical of any solution that requires them to follow certain patterns to guarantee some level portability to other engines. The more scalable solution is to get the spec to meet SIMD programmers where they are and make better portability guarantees about the code they naturally want to write.

conrad-watt commented 2 years ago

I will note that currently no toolchain reasons about any of this and that all relaxed SIMD usage (that I'm aware of) comes from users hand writing either intrinsics that are opaque to the compiler or raw assembly ... I'm skeptical of any solution that requires them to follow certain patterns to guarantee some level portability to other engines. The more scalable solution is to get the spec to meet SIMD programmers where they are and make better portability guarantees about the code they naturally want to write.

I agree - I'm assuming that the main relevant intrinsics are a "guaranteed single-rounding fma" primitive, a "don't care fma" primitive, and possibly an "is single-rounding fma supported/fast?" primitive. Both solution (a) and (b) need to do roughly the same things to support these, with roughly the same performance characteristics.

EDIT: if the point is that SIMD programmers may want to work with intrinsics that have a 1-to-1 correspondence with Wasm operations, then the difference between solution (a) and (b) is that in (a) the programmer must perform a qfma test to predict semantic behaviour (single-rounding), while in (b) the programmer must perform a qfma test to predict performance of the single-rounding fma primitive. One doesn't seem strictly preferable to the other.

ngzhian commented 2 years ago

I think it'll be useful to have specifics to look at from the application perspective, @ngzhian /@Maratyszcza I remember there were code links previously shared for how applications would use this? Would you be able to link it here?

https://hal.inria.fr/inria-00000895/document look for Fast2Mult (for posterity, mentioned in #71)

ngzhian commented 2 years ago

(b). my suggested "set" non-determinism (i.e. like Wasm's existing floating point NaN non-determinism) with single-rounding fma, which compromises on (3)

(Going to repeat some stuff here to make sure I understand things correctly.)

The purpose of "list" non-determinism is to enforce that 1 and 2 perform the same number of rounding:

if (qfma(x, y ,z)) // 1
  qfma(x, y, z)    // 2
else
  // fall back to some other code path

"set" non-determinism does not guarantee this.

Reviewing your slides, i think this is covered in slide 15 "Scenario (2)", where there is a test for qfma single rounding, if the test succeeds, the assumption is that deterministic fma is fast.

bool b = <test if qfma is single-rounding>

if (b)
  instantiate module A - wants fma to be fast // assumption here that single rounding qfma == fast fma
else
  instantiate module B - avoids use of  fma

This could be an unexpected performance cliff, as this assumption is not guaranteed. But I guess that with the original "set" non-deterministic qfma, we also have the assumption that single rounding qmfa is a fast fma. It would be nice if we can make this connection between speed and rounding explicit (but I know the spec doesn't say anything about performance, so perhaps this is not possible).

Follow-up question will then be, if we add a fma instruction, how will engines implement it? There is no fast emulation path for an fma, will it be via math.h?

Re: fma is neither relaxed or SIMD. It does feel weird adding them to this proposal, but my goal is to advance this proposal, and so I will do what it takes to reach this goal :) This might be an opportunity to try a "fast-track" proposal as well...

conrad-watt commented 2 years ago

@ngzhian

(Going to repeat some stuff here to make sure I understand things correctly.)

I think your summary is exactly right!

... [t]his could be an unexpected performance cliff, as this assumption is not guaranteed. But I guess that with the original "set" non-deterministic qfma, we also have the assumption that single rounding qmfa is a fast fma. It would be nice if we can make this connection between speed and rounding explicit (but I know the spec doesn't say anything about performance, so perhaps this is not possible).

(I'm assuming you mean 'original "list"' here) Yes, in both solutions the performance expectations are not normatively guaranteed by the spec abstract machine. It would be appropriate to add a fairly prominent note in the spec saying "if qfma is single-rounding, expect [q]fma to be fast".

Follow-up question will then be, if we add a fma instruction, how will engines implement it? There is no fast emulation path for an fma, will it be via math.h?

Yes, there would be some slow emulation path. A real performance cliff at the instruction level, but my belief (to be challenged!) is that anywhere this performance cliff is hit, an analogous program-level performance cliff (or correctness bug) would be hit in solution (a) (due to lack of single-rounding qfma). I don't have much of an informed opinion on emulation strategies, I'm just keying off the two previous comments (and associated links) here.

Re: fma is neither relaxed or SIMD. It does feel weird adding them to this proposal, but my goal is to advance this proposal, and so I will do what it takes to reach this goal :) This might be an opportunity to try a "fast-track" proposal as well...

At least with respect to "SIMD-ness", presumably the same arguments apply to qfma in that there could also be scalar versions? fma is very linked to qfma in terms of performance expectations etc, so I'm personally comfortable with it fitting thematically into the "relaxed" proposal (also, single-rounding fma is "relaxing" a design principle of Wasm with respect to predicable performance, existing examples like popcnt notwithstanding).

ngzhian commented 2 years ago

It would be appropriate to add a fairly prominent note in the spec saying "if qfma is single-rounding, expect [q]fma to be fast".

Yes, we should definitely do this if we go down the add-fma path.

(I'm assuming you mean 'original "list"' here)

Yes, oops!

At least with respect to "SIMD-ness", presumably the same arguments apply to qfma in that there could also be scalar versions? fma is very linked to qfma in terms of performance expectations etc, so I'm personally comfortable with it fitting thematically into the "relaxed" proposal (also, single-rounding fma is "relaxing" a design principle of Wasm with respect to predicable performance, existing examples like popcnt notwithstanding).

Interesting thought. We can take this chance to get scalar fma and scalar qfma in as well. It's not a lot more work, represents a bit of scope creep, but with good reasons. We have a SIMD sync tomorrow, I will bring this up for discussion (if you won't be there). Thanks!

conrad-watt commented 2 years ago

Interesting thought. We can take this chance to get scalar fma and scalar qfma in as well. It's not a lot more work, represents a bit of scope creep, but with good reasons. We have a SIMD sync tomorrow, I will bring this up for discussion (if you won't be there). Thanks!

My inclination is that scalar fma and qfma could be "quick" proposals following on from relaxed SIMD - the incumbent proposal was only going to have vector qfma anyway so I think the argument is the same whether we go with solution (a) or (b). Vector fma is just the minimum addition to facilitate set non-determinism, given that vector qfma is part of this proposal.

In any case, I'll be at the meeting so happy to discuss this further.

zeux commented 2 years ago

I want to clarify the meaning and the implication of the non-determinism classes highlighted here as I couldn't find them defined anywhere.

"set" non-determinism says "when the spec provides a set of possible outputs for a given instruction, an implementation must consistently choose the same index of the output regardless of the input value/instruction placement"
"list" non-determinism says "when the spec provides a set of possible outputs for a given instruction, an implementation is free to generate any output in every particular instruction / for any class of inputs"
As a consequence, in "list" non-determinism, any program that uses quasi-fma (fma that might do single or double rounding), may get different results for the same inputs even within the single module
As a consequence of that, in any use of qfma the program that does two passes over the input data may compute internally inconsistent results, whereas in "set" non-determinism this can never happen
As an additional consequence of that, "list" non-determinism makes it impossible to reliably detect if the qfma is doing single or double rounding for the purpose of selecting a different algorithm at runtime

Is this reasonably accurate?

conrad-watt commented 2 years ago

@zeux your "set" and "list" non-determinism should be reversed (although I should be clear that this is terminology I made up for this discussion!).

Wasm's existing non-determinism is "set" non-determism - the spec provides a (mathematical, unordered) set of possible outputs for each operation, and each instance of the operation is free to pick a different member of the set.

This proposal's non-determinism is "list" non-determinism - the spec provides an "ordered" list of possible outputs, and each instance of the operation must pick the same index of that list.

I'm proposing a modification to this proposal with the aim of eliminating the need for the new kind of non-determinism.

zeux commented 2 years ago

Ok got it, thanks. Your proposal, then, is to replace "list" with "set" for all instructions including qfma, so the last two implications stand (with corrected terminology)?

conrad-watt commented 2 years ago

As a consequence of that, in any use of qfma the program that does two passes over the input data may compute internally inconsistent results, whereas in ["list"] non-determinism this can never happen

This is true - if you have examples where this is important (i.e. algorithms/programs where it doesn't matter whether single or double rounding is used, but the result must be consistent), that would be valuable. IIUC a program can still "bake in" unfused multiply-add to guarantee deterministic double-rounding.

As an additional consequence of that, ["set"] non-determinism makes it impossible to reliably detect if the qfma is doing single or double rounding for the purpose of selecting a different algorithm at runtime

The expected workflow with my modification is the same in that you would still test qfma to switch between algorithms - the difference is that fma would be used in the fast path rather than qfma. The justification would be performance-based rather than semantic - if the program observes a double-rounding qfma, it's expected that the new fma primitive will be slow.

zeux commented 2 years ago

if the program observes a double-rounding qfma, it's expected that the new fma primitive will be slow

Got it, thank you. This is interesting, and I think a viable alternative for algorithms like this.

This is true - if you have examples where this is important

My concern with this is that this ends up making a lot of previously run-deterministic program run-nondeterministic. For example, suppose I have a 3D engine that has a lot of small matrix/vector math in it. Classically you'd try to vectorize a set of internal math primitives like M M or M V. In cases like this you usually don't need precise double-rounding, and can tolerate either single-rounding or double-rounding with no significant impact on results if this is applied consistently.

What happens then is that a program switches the internal matrix/vector math ops to qfma. Under current proposal, it's always beneficial for performance in practice: engine will select mul+add if the hardware doesn't have fast FMA, and fma if the hardware has fast FMA (or could even select MULADD if the hardware implements that but not FMA for some reason...). The program will still see consistent output for any algorithm.

Under set non-determinism, there's a class of newly introduced bugs that may happen from an inconsistent lowering. For example, frame-to-frame subpixel jitter of objects that get their world transforms via a series of matrix multiplications but in some code paths the engine randomly decides to pick a different lowering, or a tier-up optimization of Wasm code changes the compilation approach. Or, some algorithm diverges because it has a conditional on eg dot(v1, v2) > 0 where input vectors went through a series of matrix-vector transforms, and again the behavior is non-deterministic within the same module / run.

This is more or less equivalent to the implications of C's fast-math semantics, where any part of the program may get lowered differently, including the same functions getting lowered differently in different inline call sites, but even more damaging due to Wasm dynamic compilation nature.

Now, as noted above, I'd generally expect engines to be de-facto list deterministic. I don't know if this makes this better ("the programs like this will be de-facto run-deterministic"), or worse ("applications will assume they are run-deterministic because they never see behavior suggesting otherwise, but their programs aren't actually well-behaved in theory").

zeux commented 2 years ago

(additional implications of the above include difficulty with debugging/reproducing bugs in programs that use qfma. I think the main difference between qfma and other instructions in this proposal is that the difference in behavior applies to basically every single input, whereas for a lot of other instructions the non-determinism is on the edge cases that an application is likely expecting to never hit, so set non-determinism is not an issue)

zeux commented 2 years ago

Another example I've brought up in live discussion: imagine you have a program that needs to sort objects by their Z depth relative to a perspective camera. Here's a C++ code sketch that does that:

std::sort(objects.begin(), objects.end(), [&camera](Object* a, Object* b) -> bool {
    const Vector3& aPos = a->getPosition();
    const Vector3& bPos = b->getPosition();
    const Matrix4& viewMatrix = camera.getViewMatrix();

    float aDepth = (viewMatrix * aPos).z;
    float bDepth = (viewMatrix * bPos).z;

    return aDepth < bDepth;
});

If the author of the linear algebra library decides to use qfma to speed up matrix-vector multiplication in the implementation of operator*, the resulting program is correct under list-determinism rules, but results in C++ UB under set-determinism rules, because different computations of depth in different invocations of the predicate are allowed to produce different results, which will result in the predicate not obeying strict ordering rules (which it otherwise does unless any computation produces NaN).

So that's a simple example of issues with set-nondeterminism that, when applied at scale, either makes qfma not as useful or, more pragmatically, results in a large class of programs that are theoretically-broken but pragmatically work just fine.

(compiling scalar implementation of linear algebra above with -ffast-math may also result in similar consequences, but I'd argue it's because -ffast-math is broken, not because this code is broken :stuck_out_tongue_winking_eye:)

Maratyszcza commented 2 years ago

I feel that we should rename relaxed_fma (FKA qfma) to relaxed_madd (and accordingly relaxed_fnma to relaxed_nmadd), as current naming cause misunderstanding. Semantic meaning of this instruction is "fast Multiply+Add, potentially fused", but due to FMA in the name it may be mistakenly perceived as "Fused Multiply-Add, potentially broken".

penzn commented 2 years ago

If the author of the linear algebra library decides to use qfma to speed up matrix-vector multiplication in the implementation of operator*, the resulting program is correct under list-determinism rules, but results in C++ UB under set-determinism rules, because different computations of depth in different invocations of the predicate are allowed to produce different results, which will result in the predicate not obeying strict ordering rules (which it otherwise does unless any computation produces NaN).

Yes, this would also happen if operator was defined in a header and developer passed a fast math flag, unless operators are excluded from this optimization 😉 (honestly, I am not sure if they are, but one can imagine a compiler that would do that, more on that below).

I think it is worth talking about fast math a little. The reason it exists is that compilers normally refuse to reorder floating point operations due to their non-associativity, for exactly the reasons you describe. Forgoing reordering makes compiled code more deterministic, for example that is what would allow the sorting example to produce the same ordering results, if the same data is fed in the same order. However, that is not entirely immune to non-determinism, as running the same data in different order can still produce different results.

Strict mode requires writing code in a way that would allow for out of order execution, for example if a[0] + a[1] + a[2] + a[3] is treated by the compiler as (((a[0] + a[1]) + a[2]) + a[3]), then it naturally lowers to accumulation on a single register which disables instruction-level parallelism that we depend on for performance on modern CPUs. To break that somebody would have to intentionally introduce intermediates or parenthesis to break up the order (latter if compiler preserves that), which, strictly speaking, changes the output the same way fast math does. Such changes, especially produced by indiscriminately applying fast math, can break the program if you are not careful, and at least some compilers have ways to mitigate fast math' effects, for example by honoring parenthesis, or otherwise limiting the areas where it gets applied.

There are two ways to look at it, one is that floating point arithmetic precision and performance should be handled in the source code explicitly for the sake of repeatability, while the other is that since sacrificing determinism is required for floating point performance (as manually reordered sequence is still different from original one) you should embrace fast math as compiler would likely beat you at optimizations, especially for large programs.

zeux commented 2 years ago

you should embrace fast math as compiler would likely beat you at optimizations, especially for large programs.

I disagree with this. Again, as the example above shows, it's almost impossible to write correct programs under fast-math. fast-math completely breaks a set of standard functions like std::isnan, completely breaks IEEE semantics for a set of primitive operations (eg some compilers synthesize SSE sequences for comparisons that return incorrect results for NaN inputs), violates determinism across build regimes and inline callsites, etc, etc.

It's a tool that some people use. I don't think it's appropriate to use the existence of that tool as justification for Wasm specification that tried to ensure strong determinism until relaxed SIMD.

However, that is not entirely immune to non-determinism, as running the same data in different order can still produce different results.

Strictly speaking, changing the order of the input data changes the data, so this doesn't suggest non-determinism - but maybe I wasn't very precise above.

The std::sort example above, in presence of set non-determinism or fast-math, isn't just allowed to produce different orderings.

It's allowed to do anything. If the set non-determinism is applied aggressively, you may, on a conformant STL implementation, get the program that writes outside of the bounds of the input vector, never terminates, crashes with stack overflow or really anything else.

which, strictly speaking, changes the output the same way fast math does.

Introducing parentheses changes the output from one deterministic order to another. The new deterministic order will produce bit-identical results on all modern compilers and architectures under any optimization level if the target architecture uses IEEE compliant single precision arithmetic (there are examples of architectures that don't do that, like x87).

fast math doesn't guarantee any of the above - we should not equate these two approaches.

zeux commented 2 years ago

One more note while this is fresh on my mind.

Consider the example from earlier, a 4x4 matrix multiply. It’s a fundamental operation in 3D applications and can be implemented with 1 mul + 3 fmas per output row, for a total of 4 muls and 12 fmas (plus some shuffles to transpose). Applications may want to do this over using separate mul+add for performance, accepting variability in resulting matrix.

Let’s look at the resulting function from the specification/theoretical point of view.

In list-nondeterminism world, the resulting function is pure within a module (any two inputs produce the same output in a given Wasm module instantiation), and produces one of two results for any specific input: either all intermediates double-round or single-round.

In set-nondeterminism world, the resulting function isn’t pure - from the specification point of view you must interpret its behavior as if it took 12 bits for every invocation from a random source and applied them to determine rounding direction. Consequently, for any specific input it may produce any of 2^12=4096 results (for some inputs the number of unique results is smaller, 4096 is the worst case bound). Unlike NaN payloads, the difference in results is not just easily observable - it’s almost guaranteed to change the results of any subsequent operations on the result.

In list-nondeterminism, it’s still the case that there are two possible outputs. The number, however, doesn’t grow as the program does more computations. In fact, an engine could provide a debug setting to developers to toggle the behavior, and any large composition of matrix ops would have exactly two valid results for a given input, that the developer could easily validate. (if other instructions than qfma are used, rigorous verification requires more testing)

By contrast, in set-nondeterminism a larger composition of matrix ops results in a steadily growing number of possible outputs that quickly becomes infeasible to validate.

From the pragmatic point of view, if engines are de-facto list-deterministic then the problem is merely theoretical. However since the arguments for set-nondeterminism are coming from the perspective of specification, I feel like specifying set-nondeterministic behavior is creating a situation under which any program of meaningful size is not testable. It would still be probably possible to prove error bounds on algorithms, but this feels profoundly less satisfactory from specification perspective vs list nondeterminism.

rossberg commented 2 years ago

[Sorry, I forgot to reply earlier.]

@tlively:

The entire point of this proposal (no matter how we spec it) is to introduce implementation-defined behavior

Of course, I understand the goal, and that some folks want to be able probe non-determinism for implementation-dependent behaviour for the sake of some performance wins. But let's be honest, this is rather a niche. And it has to be weighed against the cost for the eco-system and its goals and promises at large.

For example, implementation-dependent behaviour is an attack vector for privacy exploits via finger-printing. You may argue that that ship has already sailed on the web, but:

Wasm is not just for the Web. And so far has been designed more responsibly than the Web.
Even for the Web, our industry should do everything in its power to bring that ship back to harbour, not knowingly blow additional wind in its sails.

With list non-determinism, I suspect we are opening this attack vector more widely than we would with just set non-determinism (though ideally, we would have neither). But what's worse, we would officially be blessing it.

And that may be doing serious damage to Wasm's reputation as a well-designed, portable language that learnt the lessons from the past – practically every time I mentioned the idea of introducing implementation-dependent behaviour to somebody not following CG discussions, they were flabbergasted that we're even considering it, given what Wasm stands for.

If folks want to probe for implementation-dependent behaviour, then fine, let them at their own risk, but I don't think the spec should go out of its way to encourage it. As far as I'm concerned, it should rather discourage it.

conrad-watt commented 2 years ago

@rossberg to push back a little, I'm a bit skeptical that the choice we make here in the spec (set vs list) is going to meaningfully influence the kind of programs that practically get written. I think fingerprinting and "implementation-sniffing" programs are hazards that come with this proposal as a whole, and in some sense the proposal is explicitly anticipating/facilitating them (e.g. we're expecting qfma testing, which I'm led to believe will be somewhat common).

If we're pragmatically accepting that this proposal is going to happen, then it makes sense to ask whether the kind of programs we expect to be written will fall within the "well-defined" semantic envelope of the spec. To me, the "set" side of the debate here is about shrinking the envelope to make the spec more editorially coherent, and the question is whether this leaves any desired programs outside the envelope in comparison to the "list" side.

In the long-term, we probably want to have a more joined-up philosophy for how we approach future "relaxed" proposals that may compromise on some aspect of determinism/predictable performance. We've talked before about how the future of Wasm may lie in feature subsetting at least at a spec editorial level (e.g. GC vs non-GC, concurrent vs non-concurrent). It may be that "relaxed" vs "strict" is also a meaningful distinction if we see more proposals like this (to be clear, I think the bar for making a proposal "relaxed" should be much higher, but I think this proposal is fairly well-motivated performance-wise).

penzn commented 2 years ago

@zeux I understand why you disagree with "fast math is fine" camp and I want to emphasize that I understand and sympathize with the arguments you are making. I am not trying to vindicate fast math optimizations per se, what I am trying to say is that there are two ways to view this, there is a substantial community that uses fast math and most compilers consider it safe enough to support, despite known issues with it (some outliers even go as far as make it default). More crucially, I think the list vs set distinction is less important than it might appear.

To illustrate that, there are couple of points. One is that not only fast math breaks ordering, but also more inane things like vectorization (which in case of Wasm SIMD turns a[0] + a[1] + ... into (a[0] + a[4] + a[8] + ...) + (a[1] + a[5] + a[9] + ...) + .... ~This means list-non-determinism guarantees won't necessarily hold already with existing SIMD proposal, unless we can somehow ban autovectorization.~ Edit: however this doesn't change output within a run, or on runs over identical data in the same order. I guess what I am trying to say is that there are many ways to change output in a technically observable fashion, without aggressively sacrificing precision, vectorization being one of them.

Secondly, while strict ordering is a very illustrative example, a more common problem is convergence (ability to produce better and better approximations), for that strict ordering is not as critical, though that can depend on situation. In practice it is hard to prove that an algorithm converges under so and so assumptions as it is hard to quantify exact range of outputs for a function, and it would be true for both set and list approaches.

Consequently, for any specific input it may produce any of 2^12=4096 results (for some inputs the number of unique results is smaller, 4096 is the worst case bound). Unlike NaN payloads, the difference in results is not just easily observable - it’s almost guaranteed to change the results of any subsequent operations on the result.

To illustrate, I think this is a substantial overcount even as an upper bound, as differences would be relatively far after the decimal point and given that there are 12 ops those differences would at least partially cancel each other (computations that can produce NaNs as intermediate results would feel this even less, though those are indifferent to set-vs-list issue anyway).

zeux commented 2 years ago

One is that not only fast math breaks ordering, but also more inane things like vectorization

I think the more precise language here would be "fast math may unlock vectorization in some cases where it would not be profitable otherwise, while changing the result". This is true, but it doesn't mean auto-vectorization is useless without fast-math. What's more, fast-math is a choice of a developer using a particular compiler - it's not something that C/IEEE mandate. In other words, the computing platform has certain guarantees that the developer of a given application may choose to relax. As a computing platform it feels like Wasm shouldn't assume the weakest possible computational model, as that enforces this choice on all applications regardless of what they think.

To illustrate, I think this is a substantial overcount even as an upper bound, as differences would be relatively far after the decimal point

"relatively far after decimal point" is not a convincing argument because the results are going to be different - it's hard to estimate the impact of that. You're correct that the bound is an upper bound and many of the results may not be unique, as I noted in the original comment - in this specific case it's a "good" case because each argument to fma is the result of the previous fma and so the error doesn't get magnified excessively.

As an experimental result, I've written this program: https://gist.github.com/zeux/bfd9e02cc1ba7dea56b91c62ad6dd169

You can compile & run it with g++ -O2 matmul.cpp && ./a.out; doing to will print

max unique: 576

576 is indeed ~8x smaller than the upper bound of 4k (I'm not sure what the true upper bound is here, note that I'm not trying to specifically generate "bad" inputs - it's just random inputs in 0..1 range without denormals etc.), but it's still substantial. How that affects the code that ends up using these will depend on what the code is doing with the resulting matrix after that.

WebAssembly / relaxed-simd