WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.4k stars 693 forks source link

We need persistent floating-point rounding mode control #1384

Open KloudKoder opened 3 years ago

KloudKoder commented 3 years ago

There's some history to this. Bear with me.

Typical floating-point rounding policies are: chop, round-negative, round-positive, and round-nearest-or-even. Intel made the dumb decision of encoding this policy in the floating-point control word, which means that if you want to change it frequently, you need to manipulate the control word (FLDCW) all the time, which is expensive.

nVidia realized that this was a nightmare for interval arithmetic, which has risen in prominence for scientific computing in the past several years, mainly because it requires constant changes in rounding policy (because, usually, the lower limit goes down while the upper limit goes up after every floating-point instruction). So nVidia did things the right way: they encoded the rounding policy inside the instruction itself.

Looking at your spec, it seems like round-nearest-or-even is the only supported mode in most cases. That makes interval math effectively impossible, which will only limit the utility of WASM for scientific applications.

Please allow us to prefix or otherwise alter floating-point instructions so as to specify the rounding mode. On Intel/AMD, you can just optimize the prefix away if it's the same one as before (and no intervening branch targets exist). If the prefix is never used, then everything is backward-compatible to round-nearest-or-even, so the control word never needs to be written; this is easily checked during code validity authentication. But on nVidia, the prefix gets sucked into a bitfield in the instruction.

Please implement this. It will enable more and more useful applications than you might currently imagine.

tlively commented 3 years ago

The best way to fit new rounding behavior into WebAssembly would be as new instructions (rather than an instruction prefix or a persistent rounding mode or something like that). Could you list the new rounding instructions you would like to see?

KloudKoder commented 3 years ago

@tlively Fair enough. I've never been a huge fan of prefixes, either.

To be clear, there's rounding in the sense of "round an existing float to an integer", and then rounding in the sense of "nudge the units-of-last-place (ULP) according to the infinitely precise result".

In the first case, we can get by with the usual suspects:

a. Round nearest-or-even. b. Round toward negative infinity. c. Round toward positive infinity. d. Chop toward zero (truncate).

In the second case, we can apply the same rounding policies, but to the ULP instead of the integer part. Implicitly, we first of all have to know the maximum error bounds on the infinitely correct result, because if they're too big, then the ULP is still undetermined. For arithmetic operations, this is straightforward and already done right in hardware. For operations which may produce irrational results, it's first necessary to know the error bounds on the truncated portion (due to series truncation). One hopes that the CPU designers did this right. I have serious doubts about that, but it might be correct for fsqrt, which AFAIK is your only such operation. You probably can't ever support other transcendental functions in a well-defined and cross-platform way because the CPU designers probably didn't really nail the error terms. (Said functions would need to be implemented by the programmer in terms of your existing arithmetic operations. Slow, but well-defined, cross-platform, and futureproof. Lots of cores, lots of free cycles, no worries. The only other option would be to hope that all CPU vendors implement the same transcendental functions in exactly the same way. Things get really complicated if that happens to be true because they're all wrong in the same way, which is why I think the software emulation approach is a necessary evil.)

Obviously, the first case can be done after the fact because we just need to dispose of the fractional part. But in the second case, everything needs to happen in one instruction because once an instruction completes, any extra bits of precision have been forgotten. So if you want individual instructions, you need stuff like "multiply and round the ULP toward negative infinity" and "square root and truncate everything after the ULP". The good news is that, if you do it this way, then it will fit like a glove with nVidia, which is probably more relevant going forward than AMD64. The bad news is that you'll have to employ some smarts to avoid touching the control word too often on the latter. And by the way, all your existing arithmetic instructions would now to be suffixed with "and round nearest-or-even".

I'll pause here. Are we on the same page?

tlively commented 3 years ago

Yes, the philosophy here sounds good. We've had similar discussions over in the SIMD proposal. The next part is making sure that the proposed instructions have (reasonably) efficient lowerings on relevant architectures (at least x86/x86-64/ARM). I'm not sure how nvidia fits into the picture, though. WebAssembly doesn't typically run on GPUs. It would also be good to mention what kind of languages or applications would benefit from having these instruction in order to motivate working on them.

KloudKoder commented 3 years ago

"WebAssembly doesn't typically run on GPUs." -- It certainly does, in 2029.

But allow me to suggest a potentially better solution for the second case. It seems to me that the only practical application for ULP rounding is interval math, full stop. Therefore, you could cut down the explosion of instructions and improve performance by providing native interval support. So "divide 64-bit floats" becomes "divide 64-bit float intervals". Yes, there are lots of corner cases, but by doing it this way, you would solve all those problems once and for all, allowing higher-level languages to translate interval types directly into WASM. I am entirely sympathetic to the need to avoid CISC nonsense, but intervals are so powerful and so universal that I think it's worth being a little CISCy in this case.

The other reason to implement native intervals is that higher-level languages do not, to my knowledge, have instructions for ULP manipulation. In many languages, it wouldn't even mean anything because floats are undefined garbage anyway. But it's easy to imagine adding macros to support basic interval functions in popular languages, which then filter straight to the WASM layer. No chance for creating super subtle bugs due to misguided ULP manipulation.

If I recall, an Arianne 5 rocket blew up back in the naughts due to floating-point imprecision which would have been prevented with intervals. And if I'm not mistaken, the safety of the Large Hadron Collider was proven with interval analysis, as well. But it's not just about safety. Computation can be exponentially accelerated, in some cases, using intervals, simply because lower precision is often acceptable, and the time savings propagate to many levels of the function tree. This facilitates much faster raytracing, for instance. Fluid dynamics stand to benefit as well. (This is all abstractly related to the Shrinking Bullseye algorithm.)

I'm intrigued by your comments regarding SIMD, but I'll leave that for another thread.

tlively commented 3 years ago

It's not clear to me why interval arithmetic should be a part of the WebAssembly ISA as opposed to a software library that compiles to existing WebAssembly instructions. Do native architectures support interval arithmetic directly? If not, there's probably no benefit to building it into WebAssembly.

KloudKoder commented 3 years ago

These are the reasons that intervals should be supported in WASM:

  1. It's almost impossible to support them in higher-level languages due to the lack of ULP manipulation primitives. What support does exist tends to be extremely slow because the correct result can still be obtained, but only through convoluted means involving type casting.

  2. WASM is the missing link we need in order to support widespread computing-on-demand, precisely because it's well-defined down to the bit level, so it's easy to validate that outputs from untrusted nodes are identical and thus credible. Without well-defined behavior, we end up with approaches which require the programmer to do manual result comparison in a code-dependent way. It just doesn't scale.

  3. Good science and engineering needs interval math, but it's not being exploited much because it's been bolted onto higher-level languages instead of built in at the bottom. nVidia solved this problem, but there's a lack of decent IR code to access it from higher-level languages. WASM could give us that missing IR.

  4. The future of distributed computing is WASM, but only if you can make it scalable. I realize that most people think of it as a browser extension just for running faster web pages, but it could rule the distributed computing world by 2030 if you play your cards right. I can't emphasize this enough. Without intervals, we get scalable ambiguity rather than scalable clarity.

If the only acceptable solution is 4 different rounding flavors of your existing primitives, then I'll take it, but the path to fewer bugs is just to implement intervals at the bottom. This could be extended to SIMD at some point, as well.

titzer commented 3 years ago

Wasm is generally designed as a minimal abstraction over hardware, so the design pressure is to expose (standardized) versions of what is in hardware, and not build the next higher-level concept unless it is impossible not to. In this case it would seem that rounding modes would be both closer to hardware and the more minimal addition. It's my understanding that the rounding mode control design in IEEE 754 is meant for this purpose, and has other purposes too, outside of interval arithmetic.

KloudKoder commented 3 years ago

If you do it that way, then at least the compilation process would be straightforward, along with extension to SIMD. This also means that interval implementation will need to be done by porting these rounding-aware arithmetic primitives primitives to higher-level languages, which is more of an uphill battle. But OK, what's the next step?

bvibber commented 3 years ago

If I understand correctly, you have the same problem with high-level languages for native compilation, so I don't think that's a Wasm-specific problem that Wasm needs to solve.

KloudKoder commented 3 years ago

@brion Well, as I said above, I'm happy enough with the idea of just implementing rounding-aware arithmetic primitives in WASM.

The general problem with ULP manipulation, whether you solve it with such primitives or full blown interval support, is that WASM is likely to become a method whereby distributed computing is facilitated for HLLs. So in effect, it is the hardware. You can't go directly from HLL to CPU/GPU machine code and have it run on anonymous targets all over the world. (Imagine trying to run a distributed computing network which mandates that you need nVidia XYZ GPU in order to even participate.) So you have to go to middleware first, of which the obvious (and perhaps only) choice is WASM. So if the HLL programmer wants a feature that is efficiently implementable in hardware, but WASM doesn't support it, then he's stuck having to do nonperformant emulation.

Yaffle commented 3 years ago

@KloudKoder , can f64.fma help with the performance of interval arithmetic libraries at least for f64?

KloudKoder commented 3 years ago

@Yaffle Can you explain your reasoning as to why multiply-add would help? For instance, how would this accelerate the process of adding 2 intervals, given that presumably it's still going to do nearest-or-even?

Yaffle commented 3 years ago

it helps to find the error of the multiplication of two f64 numbers or division of f64 numbers: sum = a + b; error = -sum + absmax(a, b) + absmin(a, b); product = a * b; error = fma(a, b, - product); quotient = a / b; error = -fma(quotient, b, -a); if you know the error, you can round value by incrementing (for example, when error > 0 and rounding to positive infinity) or decrementing (for example, when error < 0 and rounding to negative infinity) the value by the smallest amount. This way, the resulting interval will be tightest. Although, If you don't need tightest interval, you could just increment/decrement both ends... Btw, there is a website with some benchmark - http://verifiedby.me/kv/rounding/index-e.html#:~:text=Benchmark Hm...

KloudKoder commented 3 years ago

@Yaffle Wait a second, are you talking about fma (which I just realized isn't actually in the spec, if the search function is to be believed) or fmax/fmin?

Just starting with addition, we have:

(a, b) + (c, d) = (RoundNegative(a + c), RoundPositive (b + d))

How that would involve a multiplication, I don't presently see. And multiply-add takes 3 parameters, not 1. Clearly, I'm misinterpreting your notation.

Yaffle commented 3 years ago

@KloudKoder , fma that is fused-multiply add, I have updated my notation

KloudKoder commented 3 years ago

@Yaffle I think I finally understand your point. What you mean is that Intel's various FMA instructions compute:

a * b + c

to infinite precision, and then round nearest-or-even. (They don't round after just computing the product.)

This means, for one thing, that:

product = FMUL(a, b) error = FMA(a, b, -product)

is equivalent to:

error = NearestOrEven(ab - FMUL(a, b))

which by definition is off by no more than half a ULP, so you need only nudge the result by 1 ULP in order to obtain an upper bound on the error.

This is a valid approach, but ULP nudging requires intensive overhead: you need to first treat the result like an integer, then add your ULP (as an unsigned 1), then see if you wrapped into the exponent field, or perhaps created a NaN, etc., then potentially do a rollback and update the result to infinity or whatever. There are other ways to do this, but AFAIK also involving painful overhead. And do note that I haven't demanded the tightest interval -- just a "tight enough" one for practical precision requirements. So do you have a fast way to nudge? If not, then unfortunately it's back to my original point, which is that intervals are nonperformant under the current ABI, relative to what's actually possible on real hardware.

Also, note that I've written FMA above, which alludes to the Intel instruction set. Webassembly's fma, by contrast, doesn't seem to exist. Do you have a link? It doesn't come up in the doc search on the website.

Finally, from your link:

"addition/subtraction/multiplication/division/sqrt including the rounding mode specification to the command itself"

That's precisely what this proposal is all about. It's great to have that capability, but we need a direct way to exploit it.

I'm happy to have my pessimism disproven...

Yaffle commented 3 years ago

@KloudKoder , https://github.com/mskashi/kv/blob/2ec8bac6216d25b226c09e1740487837a4fc5fd5/kv/rdouble-nohwround.hpp#L97

f64.fma is mentioned at https://github.com/WebAssembly/design/blob/master/FutureFeatures.md#:~:text=fma but if you don't need the "tighest interval", you could just nudge by 1 ulp both ends... although, this could be bad (?).

anyway, you are right, that the overhead could be big

KloudKoder commented 3 years ago

@Yaffle I now see f64.fma under "Additional floating point operators". I'm willing to assume that it will be adopted as an instruction. And definitely I wouldn't demand the tightest interval, which is unlikely to offer any practical advantages. Now, looking at your alternative nudge:

x + ABS(x) * (2-53+2-105)

it's clear that the point of this is to add the result (x) to: itself shifted down so that its leading 1 appears in the ULP. I'm not entirely sure why we need the (2**(-105)) -- maybe to break the tie case of nearest-or-even in favor of a nonzero nudge -- but in any event, here's what we would need to do:

  1. Given f64 result x.

  2. If x is +/- 0.0, then nudge it up/down to the smallest positive/negative number and quit.

  3. Subtract 53 from the exponent field. Call the result y. (In some architectures, this might require transport between float and integer registers. Let's just assume there's no cost for that.)

  4. Use a conditional set (flag-to-integer) instruction to remember if #3 resulted in a wrap of the exponent (usually not). This involves several integer instructions, however you do it, so far as I can see.

  5. OR the result of #4 to a something-failed flag, f, which is extremely likely to end up as 0.

  6. Let z, the final result, be the sum of x and y. Just simply nearest-or-even rounding will ensure that z is a valid nudge of x.

  7. At some future checkpoint prior to commit, verify that f is 0, and fail out if not.

This does in fact work. It also disrupts the pipeline badly, resulting in a long series of essentially serialized instructions. This could be partially overcome by weaving together a whole bunch of interval operations which can occur in parallel.

Thus it still seems that supporting instructions with builtin rounding modes is far cheaper. One could also just ignore the trend toward integer arithmetic, and thereby make WebAssembly less useful in scientific computing. That would be a shame, though.

Thanks for helping to demonstrate the scope of this problem so thoroughly. Where to go from here?

KloudKoder commented 3 years ago

What if I write the Intel/AMD support for you? Would that offer you maintainers enough motivation to port it to other platforms and make it part of the spec? We can argue over the specifics and corner cases, but there's no point in even discussing them if my proposal here isn't sufficiently compelling.

tlively commented 3 years ago

@KloudKoder thanks for offering to help implement :) Unfortunately there's a lot more than implementation that goes into getting a new proposal standardized, so this is unlikely to advance without more of the community showing interest. See one of my previous comments on why nice-to-have features often don't make the cut, even if they don't seem to have any technical downsides.

KloudKoder commented 3 years ago

@tlively I read your previous comment and I understand why new opcodes are a hard sell.

In this case, that's particularly vexing because intervals have suffered this chicken-and-egg adoption problem since the day they were first proposed for use with IEEE floats. Basically there's a great temptation to avoid using them in high level languages because they're so slow, but no one wants to do the dirty work of optimizing their performance by creating intrinsic support because (surprise!) no one finds them appealing to work with (unless they're desperate for error containment). At the end of the day, making intervals efficient and trivially accessible is a matter of ensuring data security -- just in the analog instead of the digital realm. We're wasting nVidia's (and perhaps others') great contribution of interval-ready floating-point instructions.

One glimmer of hope is that there are already public specs on the expected behavior, available both as a comprehensive body of axioms and also a "diet" version for basic arithmetic. So at least some aspects of their expected behavior are already defined for us.

tlively commented 3 years ago

In this case adding interval arithmetic to WebAssembly wouldn't help that chicken-and-egg problem because it would not be any faster than an interval arithmetic library compiled to WebAssembly unless there is underlying hardware support. I recommend going the easier route of writing such a library for now, and if common hardware natively supports interval arithmetic in the future, it will make sense to revisit adding it to WebAssembly as well.

KloudKoder commented 3 years ago

@tlively I get that, but it's not accurate. nVidia has hardware interval primitives, in the sense of floating-point instructions with builtin rounding modes. They aren't accessible via WebAssembly.

The Intel way involves setting a control word first, which means you need to do a bunch of high limits in series, then a bunch of low limits, in order to minimize control word manipulation. Writing a library doesn't help because this is something that needs to occur on the compiler level. For example, I could create macros in a library which do interval operations in AVX2 or X87, but unless I control the compiler, I can't get it to do those reorderings so as to avoid all the crippling control word manipulation.

You may see WebAssembly as a way to accelerate websites. While I don't disagree, I also see it as a global distributed computing platform just waiting to happen. Golem, for instance, has already enabled that it some hacky way, but it could be so much more if it ensured analog security by including this type of support. This is particularly important when attempting to ensure that results obtained via different platforms (including those not using WebAssembly) are consistent with one another, as floating-point results of complicated computations are seldom exactly equal. No intervals means this effectively can't happen and WebAssembly never reaches its true potential. So there's a lot more to this than just "please make this fast".

tlively commented 3 years ago

Oh I see, I didn't realize this was already supported by Intel hardware. How widespread is the support?

penzn commented 3 years ago

It is pretty much ubiquitous on CPUs, not only on x86, it is what fegetround and fesetround library functions are for. The tricky part is that rounding mode is a "global" (per-thread) state which affects FP operations.

KloudKoder commented 3 years ago

@tlively

Oh I see, I didn't realize this was already supported by Intel hardware. How widespread is the support?

@penzn Is correct. This type of rounding support has been present on Intel for decades. As he indicated, it's a terrible design because changing the rounding mode requires a floating-point pipeline stall. nVidia had the foresight to integrate rounding mode directly into its instructions, so for instance you can "add and round toward positive infinity" or "multiply and chop toward zero".

What I'm advocating here -- if not comprehensive interval primitives -- is at least a set of opcodes and/or prefixes which will allow efficient access to the relevant nVidia instructions. Those opcodes and/or prefixes can then be at least somewhat efficiently mapped to Intel by grouping same-rounding-type instructions into dense sequences, so as to minimize rounding mode manipulation. However, this could be done as a performance enhancement; the first release could just change rounding modes literally as specified by the WASM code.

penzn commented 3 years ago

As he indicated, it's a terrible design because changing the rounding mode requires a floating-point pipeline stall.

I don't think I indicated that ;) It is the reality of CPU FP environment, though maybe it could have been designed better.

This type of rounding support has been present on Intel for decades.

If I understand things correctly, it exists on Arm as well, at least the same fenv functions work there. As usual, there might be discrepancies between the two on how different instruction groups interact with rounding mode.

There is a discussion about better IEEE 754-2019 support in Future Features:

To support exceptions and alternate rounding modes, one option is to define an alternate form for each of add, sub, mul, div, sqrt, and fma. These alternate forms would have extra operands for rounding mode, masked traps, and old flags, and an extra result for a new flags value.

This is similar to original idea for "releaxed SIMD" fpenv: https://github.com/WebAssembly/design/issues/1401#issuecomment-789073168.

Disadvantage of first class interval support is that it would require runtime optimization on CPU, while we don't yet have a viable GPU implementation where it could be organically supported. However the idea in "Future Features" suffers from the same issue.

Advantage of interval math is that it can reduce instruction count (two for price of one), but to be useful it might need to be relatively high-level. For example, usable interval math library requires more operations than listed above and additional operations might not be easy to implement without them also becoming instructions. Another issue is scheduling - to do a few low or high limits in a row parts of different interval instructions would need to be scheduled together, which requires even more optimizations in runtime or a new execution model.

KloudKoder commented 3 years ago

@penzn Yeah my wording wasn't good. I meant that you mentioned the control word architecture, which in my opinion is a terrible design. The bottom line is that a modest percentage increase in native Intel instruction count due to changing the rounding mode results in a multifold explosion in execution time due to the pipeline stalls. As you clearly understand, the only partial mitigation for that would be to group instructions by rounding mode to the extent possible.

As much as I'd be thrilled to see complete interval support in WASM, it's unnecessary. The nVidia building blocks would be enough for practicality. (I wonder how Apple M1 does it, for that matter.) I understand that there's no GPU support yet, but I would expect that to be in WASM's future evolution. Why do you think that interval support would need to be more high-level? Even transcendental functions (e.g. log) can be straightforwardly computed from basic interval arithmetic.

As to the Future Features proposal, it would be a great help toward efficient interval implementation. I am, however, opposed to anything that would set result flags (e.g. zero, infinity, NaN, etc.) unless that step can be bypassed somehow, such as by writing the flags result to a write-only trash register. The reason is that it adds major overhead to evaluate those flags, even on a hardware level. I suspect that modern FPU pipelines know when they can skip this step, such as when 2 floating-point instructions occur in a row, eliminating the need to set flag bits based on the results of the first one.

penzn commented 3 years ago

What do you think is the minimal set of interval operations which would be implementable and beneficial on CPU as well?

KloudKoder commented 3 years ago

@penzn I can hardly think of a more relevant question for this thread.

First of all, there are really 2 basic architecture choices:

  1. Look like Intel. In other words, have a "set rounding mode" instruction. ("Get rounding mode" isn't necessary because competent software should know what it already set. But no harm if people want that.)

  2. Look like nVidia. In other words, have 4 forms of each instruction: (a) round nearest-or-even (which is what we already have in WASM), (b) chop (inwards toward zero), (c) round toward negative infinity, and (d) round toward positive infinity.

I would gravitate strongly toward option #1 because (a) it will be simpler to optimize for Intel (even though theoretical performance will be identical in both cases) and (b) I don't think anyone wants a massive zoo of similar animals with slight differences in their stripes.

So assuming that we go with this option, then I would strongly caution against the use of any "set floating-point environment" (SFPE) instruction which simultaneously affects exception masking. This will burden rounding mode changes with lots of other logic. Separate these functions across different instructions. (SFPE instructions are OK for the sake of compatibility, but they should be clearly marked as nonperformant.) Critically, neither SFPE nor set-rounding-mode should accept variable inputs. (Imagine having to ask "What's the rounding mode?" prior to every instruction so you know what to execute on nvidia!)

Then the question becomes: what is the minimal set of instructions to be worth discussing here? Probably:

  1. Set round nearest-or-even.
  2. Set chop (round inward).
  3. Set round negative infinity.
  4. Set round positive infinity.
  5. Load ULP.
  6. Nudge inward.
  7. Nudge outward.
  8. Get sign bit (slightly different than "get sign").
  9. Xor sign bits, output to integer or flag (or maybe just "branch if sign bits (don't) differ").
  10. Copy sign bit to another floating-point value.

Just for sake of consistency, we might want the first 4 opcodes above to be assigned in a manner consistent with Intel's RC bits in the FPU control word.

5 takes a floating-point (FP) value, then returns the smallest-magnitude FP value which, when added to the input with chop rounding, would change its value. (Implicitly, ULP(X) has the same sign as X.) If the input was either sign of zero, then the output shall be FLOAT_MIN (the smallest denormal) of the same sign. If the input was either sign of infinity, then the output shall be unchanged. ULP comes up all over the place in interval arithmetic, so there's a fundamental justification for this instruction. For example, it would be helpful in case one needs to multiply ULP by another FP value in order to compute a compound error term, as would occur when efficiently computing (Y log X) via (truncated) Taylor series. It's also used all the time in order to know when to stop computing such a series.

6 Decrements the mantissa (without changing the sign). Decrements the exponent as well if the mantissa wraps. May result in a denormal value. If the input was either sign of FLOAT_MIN, then the output shall be zero of the same sign. If the input was either sign of zero or infinity, then the result shall be unchanged.

7 Increments the mantissa (without changing the sign). Increments the exponent as well if the mantissa wraps. If the input was either sign of zero, then the result shall be FLOAT_MIN of the same sign. If the input was either sign of infinity, then the result shall be the same. If the input was either sign of FLOAT_MAX, then the result shall be infinity of the same sign.

8 Returns the sign bit, which is important if there's a semantic difference between negative and positive zero, which would not be captured by a traditional "get sign of FP operand" instruction. Note that this deals with the sign bit, which in the aforementioned case actually differs from the (arithmetic) sign.

9 is important for interval division. For example, the reciprocal of [2, 3] is [(1/3), (1/2)] (at least (1/3) and at most (1/2)). But the reciprocal of [-2, 3] is [(1/3), (-1/2)], meaning at most (-1/2) OR at least (1/3). (Inverted limits imply an open interval, wherein a finite segment is missing from the real line.) The reason is that [-2, 3] contains zero. Moreover, it contains points both to the left and to the right of zero, so the reciprocal approaches infinity of either sign. We need to know if the signs are opposite so we can determine whether or not to exchange the limits. Without this instruction, it's lots of branching with sign extraction just for the basic arithmetic function of division.

10 This comes up all the time, although good examples escape me at the moment. Without this instruction, it's a lot of pipeline stalls.

I don't mind if the above instructions are undefined for NaN operands, or if they behave as though the NaN is a normal FP value. No point in accelerating broken math. Of course, any undefined operation will potentially leak information about the underlying hardware, and will defeat the possibility of result verification through comparison.

penzn commented 3 years ago

I would gravitate strongly toward option #1 because (a) it will be simpler to optimize for Intel (even though theoretical performance will be identical in both cases) and (b) I don't think anyone wants a massive zoo of similar animals with slight differences in their stripes.

I am all for setting rounding mode instead of first-class interval math. I am also personally OK with persistent rounding mode, though I think in Wasm we have been trying to avoid global state like this - the alternative would be to add variants of math ops with "rounding mode" flag instead of instructions 1-4 on your list.

So assuming that we go with this option, then I would strongly caution against the use of any "set floating-point environment" (SFPE) instruction which simultaneously affects exception masking. This will burden rounding mode changes with lots of other logic. Separate these functions across different instructions. (SFPE instructions are OK for the sake of compatibility, but they should be clearly marked as nonperformant.) Critically, neither SFPE nor set-rounding-mode should accept variable inputs. (Imagine having to ask "What's the rounding mode?" prior to every instruction so you know what to execute on nvidia!)

+1 on that as well.

  1. Set round nearest-or-even.
  2. Set chop (round inward).
  3. Set round negative infinity.
  4. Set round positive infinity.
  5. Load ULP.
  6. Nudge inward.
  7. Nudge outward.
  8. Get sign bit (slightly different than "get sign").
  9. Xor sign bits, output to integer or flag (or maybe just "branch if sign bits (don't) differ").
  10. Copy sign bit to another floating-point value.

This looks like a good start, those are standard operations. However denormals can be tricky, since they are not consistent across platforms.

KloudKoder commented 3 years ago

@penzn If the only way to sell this is to have 4 flavors of each FP instruction (one for each rounding mode), then I'd agree to that. nVidia would be pleased (and who knows, might throw down some financial support).

I'm willing to forget about denormals and just replace all above references to them with "the FP values with minimum normal exponent" provided that there's an inexpensive way to know when results are untrustworthy because a would-be denormal was generated. The reason is that higher-level software needs to know that result verification is failing between machine X and machine Y because they handle denormals differently, and not because a physical error occurred on one of them. Otherwise, we would either get into an infinite verification loop or, worse, conclude that a consensus had been achieved when in fact many machines got the wrong answer in the same way. I assume that the way to detect the difference would be to either unmask the denormal exception, or to leave it masked but inspect its (presumably sticky) status prior to issuing the final result. However, if that adds too much overhead, then we might have to rethink this. (Again, to me, "WASM" means "global distributed computing intermediate representation bytecode", so I'm not worried about single-threaded browser operations in this case.)

penzn commented 3 years ago

@penzn If the only way to sell this is to have 4 flavors of each FP instruction (one for each rounding mode), then I'd agree to that.

That can be just one variant, but taking a constant rounding mode parameter.

About denormals - wasm ultimately supports them, so after a bit of thinking this might be fine the way it is.

I personally think this can be turned into a proposal. What does everybody else think?

KloudKoder commented 3 years ago

@penzn "just one variant"? By "variant" do you mean "alternative proposed means of encoding rounding mode" or redundant supported means of the same?

While denormals were a terrible design decision that exploded complexity in order to support a trivial extension to the representable domain, I suppose we should support them if: (the hardware support is ubiquitous and deterministically compatible among major vendors) or (detecting their appearance, even if they have since disappeared, is trivial and universally supported). It's not clear to me whether the foregoing predicate is true.

As to a proposal, obviously, I'm in favor. I would suggest that any approval involve review from at least one expert involved in IEEE1788 or IEEE1788.1, the full and partial interval arithmetic standards, respectively. In particular, we should see to it that IEEE1788 is efficiently implementable using the proposed primitives. (I believe this to be the case, but I'm not certain.) Not that we're proposing to implement either standard here, but that it should be possible to do so with tolerably modest overhead.

KloudKoder commented 3 years ago

Speaking of experts in interval arithmetic, this is the video that got me thinking about the importance of analog security way back in 2007. The issues raised in the video are even more relevant today because, frankly, so little progress has been made. It's time to change that. The speaker is G. Bill Walster. I'll cut right to the chase, but it's worth watching the full video:

https://www.youtube.com/watch?v=U_Cq_h1vrD0&t=680s

I would reach out to him but am not able to due to security constraints on my end. Perhaps one of you would kindly inform him of this thread. He would make an ideal reviewer.

https://www.researchgate.net/profile/G-Walster

penzn commented 3 years ago

Disregard my previous comment about denormals, I think we can fully embrace them here, since that is what Wasm does overall (also see WebAssembly/simd#2).

The operations you've listed map to standard math operations, we should probably just support them. I suspect CPU-side interval arithmetic libraries build on those today anyway.

@penzn "just one variant"? By "variant" do you mean "alternative proposed means of encoding rounding mode" or redundant supported means of the same?

I mean new instructions in addition to existing instructions in the spec. Let's say for multiplication we will need to add a new instruction (say) mul_round, taking rounding mode parameter; then repeat that for the rest of instructions dependent on rounding. This would be the most spec-neutral way to do it, though that is obviously redundant as new instructions would include support for existing rounding mode. I personally see value in sticky rounding mode since that is what happens for native targets, though that may be a hard sell. Another hypothetical option is the encoding optional rounding parameter within existing instructions, but (1) it is unclear how to do that and (2) I don't think we have any precedent of that.

I think there is enough material for at least a CG discussion, to go over the operations in https://github.com/WebAssembly/design/issues/1384#issuecomment-810695841 and decide which way of encoding rounding mode is preferable. @KloudKoder can you attend one of the meetings (see CG ones here)? You will need to create a Community Group account for that and dial in via Zoom - the meetings are open to people with interest to participate.

Speaking of experts in interval arithmetic, this is the video that got me thinking about the importance of analog security way back in 2007. The issues raised in the video are even more relevant today because, frankly, so little progress has been made. It's time to change that. The speaker is G. Bill Walster. I'll cut right to the chase, but it's worth watching the full video:

https://www.youtube.com/watch?v=U_Cq_h1vrD0&t=680s

Thank you for the link - the video is very informative! I am not sure what is W3C process for requesting external reviews, we can bring this up as well.

KloudKoder commented 3 years ago

@penzn OK on denormals. Horrible legacy, but let's continue with it if it already works in WASM.

Another hypothetical option is the encoding optional rounding parameter within existing instructions

How about this: Instead of creating 3 new instructions for every existing roundable instruction, we just create 3 instruction prefixes (round negative, round positive, and chop). No prefix implies nearest-or-even, so it's backward compatible. Redundant prefixes, prefixes before nonroundable instructions, attempts to define return or jump targets immediately following a prefix, or code which terminates in a solitary prefix will all get busted by the linter during compile. (I'm just wary of the current and prospective explosion in instruction definitions implied by 4 different rounding modes, especially when we consider the incessant flood of new SIMD that seems to be a reliable industry dynamic.) But this is not a red line for me. If the team wants to handle multiple instruction flavors in lieu of implementing prefixes, then so be it.

You will need to create a Community Group account

Link, please. Zoom is a giant security hole, but I'll see what I can do. I do understand the value proposition, though.

KloudKoder commented 3 years ago

I just realized that we also need:

  1. Nudge positive.
  2. Nudge negative.

The reason is that, when computing a Taylor series for the sake of estimating a transcendental function, one often encounters a situation wherein the error due to truncation is known to be smaller than a ULP in magnitude, but of unknown sign. In such cases, we need to subtract a ULP from the low limit and add a ULP to the high limit. (By definition, we don't need to consider rounding implications when adding or subtracting a ULP because the result is exact, although some corner cases exist, e.g. the ULP of zero is epsilon, the smallest denormal.) This requires signed nudges rather than zeroward nudges.

To those who, like me, hate adding instruction definitions just for the sake of aesthetics, I will preemptively admit that all nudges are technically unecessary because one can simply load a ULP (instruction #5), then add it to or subtract it from some other float. But that requires explicit use of a precious register and several instructions. Furthermore, nudges are so fundamental that I believe they deserve their own instructions. It would not at all surprise me to see them in hardware as the semiconductor industry finally wakes up to the relevance of intervals in physical security. That would speed things up and also get rid of the need to waste time and a register evaluating the ULP.

Note also that nudges usually come in pairs. This means that, under the hood, there's an implicit "Load ULP" occurring over and over again. Be careful! This isn't necessarily one and the same ULP, so despite appearances, they cannot in general be elided away.

penzn commented 3 years ago

Link to join CG: https://www.w3.org/community/webassembly/

You need to add an agenda item (open a PR) to one of the CG meetings, 4/27 seems to be quite open, there are also instructions on getting an invite to the call in the meeting files: https://github.com/WebAssembly/meetings/tree/master/main/2021

Let me know if you have any questions, I'd be glad to help.

To those who, like me, hate adding instruction definitions just for the sake of aesthetics, I will preemptively admit that all nudges are technically unecessary because one can simply load a ULP (instruction #5), then add it to or subtract it from some other float. But that requires explicit use of a precious register and several instructions. Furthermore, nudges are so fundamental that I believe they deserve their own instructions.

Separate operations allow for additional flexibility in how they can be implemented, though there are other considerations as well, for example performance gains (or lack of thereof). At any rate, this can probably be turned into a proposal first and then we can sort out exact instruction list.

KloudKoder commented 3 years ago

@penzn Thanks. I'm waiting on a reply to my CG signup request. I will post back to this issue when I have a meeting date.

KloudKoder commented 3 years ago

@penzn FYI still no reply to my request.

penzn commented 3 years ago

We might need to ask @dschuff about it.

KloudKoder commented 3 years ago

@penzn Thank you, it's in process now, pending the resolution of a technical issue with signing up. Will report back.

KloudKoder commented 3 years ago

@penzn The person that was assisting me seems to have disappeared. As of now, I finally have a W3C account, but I'm wondering how to schedule a meeting. Do I make an edit to one of the MD files here, or what?

https://github.com/WebAssembly/meetings/tree/master/main/2021

This has got to be the most awkward way of scheduling a meeting I've ever seen, but I get it: we're dealing with a standards body and have to be painfully formal.

penzn commented 3 years ago

@KloudKoder yep, that is correct - edit one of the files and submit a PR with changes. Did you join the community group? The link is https://www.w3.org/community/webassembly/join (can be found on CG page. I don't remember if you get added to CG meetings automatically or still need to email the chair.

@dtig, is there space for this in April 27 meeting or would it be completely consumed by discussions on layering and linking? Also, if you have access to meeting invites, can you please take a look at @KloudKoder's situation?

KloudKoder commented 3 years ago

@penzn Looks like we're set for May 11, 2021, assuming I did this PR properly?

penzn commented 3 years ago

I think you need to update PR with how much time you are planning to take to present, or comment on it, then it will be official: https://github.com/WebAssembly/meetings/pull/756#issuecomment-822616912

KloudKoder commented 3 years ago

Notes for the CG meeting today (nothing etched in stone):

LOAD ULP (LULP)

Load units-of-the-last-place into an f32 or f64. (This instruction works analogously on f32 and f64.) It's important for interval operations whose error bounds are a function of the last term computed in a Taylor series. So for instance one could multiply that last term by ULP, then widen the interval by that much on each side. It's very expensive to compute with logic instructions and appears all over the place in routine interval calculations.

There are a few different cases:

  1. +-0.0 -> +-epsilon (smallest-magnitude nonzero float).

  2. +-infinity -> +-infinity. There's really no choice about this because the next logical option is NaN, which is worse.

  3. +-small -> +-epsilon where small is anything less than 1.0x2^(-125) for f32 or 1.0x2^(-1021) for f64.

  4. +-other -> +-X where X is the smallest positive float that, when added to other under chop rounding, would change its value.

LOAD HALF ULP (LHULP)

There are distinctly different roles for LULP and LHULP. LULP is used to compute final error bounds, but LHULP is used to test the current Taylor term for triviality. Once the current term becomes smaller than half a ULP, it's often known that the maximum error is either a full ULP, or the product of a full ULP and the current term. It has different corner cases due to discontinuity of exponential domains, which is why it's very expensive to compute from the programmer's level of abstraction.

  1. +-0.0 -> +-epsilon.

  2. +-infinity -> +-infinity.

  3. +-small -> +-epsilon where small is anything less than 1.0x2^(-124) for f32 or 1.0x2^(-1020) for f64.

  4. +-other -> +-(X/2) where X is the smallest positive float that, when added to other under chop rounding, would change its value.

NUDGE INWARD (NIN)

This instruction is used to expand an open interval by moving one of its bounds inward toward zero by a ULP. (Open intervals are implied by their bounds being listed in reverse. They mean "everything on the real line except for this interval".) Note that [-0.0, -0.0], [+0.0, +0.0], and [-0.0, +0.0] all mean "only zero" while [+0.0, -0.0] means "everything except zero".

  1. +-infinity -> +-infinity.

  2. +other -> max(+0.0, +other -LULP(+other)) where (-0.0)<(+0.0).

  3. -other -> min(-0.0, -other -LULP(-other)) where (-0.0)<(+0.0).

As you can see, it's quite expensive.

NUDGE OUTWARD (NOUT)

This instuction is used to expand (closed) intervals by moving one of its bounds outward from zero by a ULP.

  1. +-infinity -> +-infinity.

  2. +-other -> +-other +LULP(+-other).

NUDGE POSITIVE (NPOS)

This instruction is used to increase an interval bound by the upper bound of an error term known to be positive but less than than the absolute value of a ULP. This instruction may seem trivial, but it's not because there are fast ways to add or subtract a ULP without going to the trouble of creating a ULP its own register.

  1. +-infinity -> +-infinity.

  2. +other -> +other +LULP(+other).

  3. -other -> -other -LULP(-other).

NUDGE NEGATIVE (NNEG)

This instruction does the opposite of NPOS.

  1. +-infinity -> +-infinity.

  2. +other -> +other -LULP(+other).

  3. -other -> -other +LULP(-other).

GET SIGN BIT (GSB)

This is equivalent to returning the arithmetic sign of a float, except that it considers (-0.0) as a negative number. This is intended to help interval code figure out which way it needs to nudge. It's debatable whether this bit should be sent to an integer register or a flag. Perhaps a branch version is as good or better.

SET SIGN BIT (SSB)

Copies the sign bit from an integer's bit zero, or perhaps the carry flag.

XOR SIGN BITS (XSB)

Xor the sign bit of one float to the sign bit of another. Useful for avoiding lots of branches during interval inversion, such as can occur during interval division.

COPY SIGN BIT (CSB)

Copy the sign bit of one float to the sign bit of another. (Is this the same as CopySign? I think we need an instruction for the sign bit, and another for the arithmetic sign.)

ROUND NEAREST-OR-EVEN, CHOP, ROUND NEGATIVE, ROUND POSITIVE

Are these prefixes, instruction flavors, or part of a global variable describing rounding policy? Prefixes provide most backward compatibility with least future instruction expansion. Need to discuss.

conrad-watt commented 3 years ago

I may not be able to make the meeting today so just getting my opinions out

KloudKoder commented 3 years ago

For the record, the presentation of this issue to the community group is being rescheduled, tentatively for June 8. This is due to cascading technical failures on my part which prevented me from participating today, despite a valiant attempt by the hosts to assist. I know how to prevent this from occurring next time.