WebAssembly / design

WebAssembly Design Documents
http://webassembly.org
Apache License 2.0
11.38k stars 693 forks source link

Floating-point rounding mode control prototyping #1456

Open KloudKoder opened 1 year ago

KloudKoder commented 1 year ago

This is the issue for testing out approaches to floating-point rounding mode control, which is required for performant interval arithmetic. The original informal discussion is found here.

For an understanding of the practical ramifications of interval arithmetic including some use cases, I suggest this YouTube presentation by G. Bill Walster.

For starters, Thomas Lively identified the Boost library, boost::interval. An example of its rounding control approach is here.

Feel free to add references to practical use cases or libraries related to intervals. The ultimate output of this issue, hopefully, will be a performance test. Conrad Watt has suggested a switch-based approach (for testing, not actual implementation) for every floating-point instruction, wherein the subject of the switch is a rounding control (RC) global variable (nearest-or-even, round toward negative infinity, round toward positive infinity, or chop toward zero). His hypothesis is that the WASM compiler will eliminate many of the switch statements because the rounding mode (RM) at any given point is usually known at compile time. This might at least allow us to do some crude benchmarking.

A more painful but performant approach would be to have the same virtualized RC register, but to reorder instructions where possible in order to minimize the frequency of touching the actual hardware RC. This would also compile seamlessly in architectures in which the RM is embedded in the floating-point instruction opcode itself (in which case the reordering would be redundant).

At this point, we're only concerned with RISC, but as Deepti Gandluri pointed out, any mature implementation would need to address the same issues with SIMD, wherein each parallel operand adheres to a common but custom RM.

@conrad-watt @tlively @jakobkummerow @dschuff @dtig

tlively commented 1 year ago

Since the actual proposal is to add more control of rounding modes to Wasm, it would probably make sense to broaden the scope from interval arithmetic to any use of different rounding modes. Casting a wider net will help find more motivating use cases.

Feel free to add references to practical use cases or libraries related to intervals. The ultimate output of this issue, hopefully, will be a performance test.

I'm not as concerned with trying to do more performance tests at this point, since you convincingly explained how slow emulating different rounding modes would be. I'm more concerned right now about having concrete motivating use cases. Ideally we would find someone with something (an application, a library, a language, etc) they would like to port to Wasm for whom the lack of rounding mode control is a significant hurdle. Even with a perfect design and convincing benchmarks, it would be hard to motivate engine and tool implementers to work on this (or anything else) without such a concrete use case.

KloudKoder commented 1 year ago

I hear that.

For now, some history of the Boost interval package going back to 1999 or so (beware, not HTTPS):

http://www.mscs.mu.edu/~georgec/IFAQ/maurer1.html

"Originally, I only intended the package to assess rounding errors in lengthy computations. Since then, I've learned that there are other application domains where interval arithmetic is useful" <- And with this quaint understatement, the analog security era was born!

The point being simply that this is hardly a new innovation (not that anyone here seems to think that). Note that G. Bill Walster was deeply involved. More to come by way of current examples.

KloudKoder commented 1 year ago

To address your specific comment about broadening the scope, I certainly don't oppose that. I just don't personally know of any practical uses for RM tweaking apart from interval arithmetic. The one sort of exception would be integer rounding. On X86, for one: "The FRNDINT instruction returns a floating-point value that is the integral value closest to the source value in the direction of the rounding mode specified in the RC field". Not sure how float-to-int is implemented in WASM, but in theory it would be amenable to RC flavors just like any other floating-point instruction, but would have this very different effect (nudging by up to 1.0 as opposed to one unit of the last place (ULP)).

KloudKoder commented 1 year ago

TL;DR: Some algorithms are polynomially slower without interval arithmetic; others just simply don't work.

Alright, I promised to list some practical examples. On the one hand, you're obviously not going to find people ranting that they can't get intervals to be performant (or even correct, due to the lack of compliant libraries) in WASM because, like, what purpose would that serve without going through the entire process that we're literally going through here and now? From the public discourse I've read on various forums, it's clear that people just assume they need a specialized library to run intervals in a performant manner on a particular piece of hardware. Pending any change to WASM, that seems to be just how it is.

Therefore, in the absense of any published rant about this, I'll attempt to provide the next best thing, which is a list of compelling applications that I've gathered from public sources. Here goes:

Is that helpful?

jakobkummerow commented 1 year ago

Is that helpful?

It's a list of situations where interval arithmetic may be useful. Nobody argued that interval arithmetic is never useful. (I think some of the specific entries on this list are quite debatable, but that's beside the point.)

What folks were asking for is a concrete example of an application/library/etc that would like to adopt Wasm but currently can't because of lack of control over rounding modes.

To state it very clearly, there are two issues: (1) People across the ecosystem are busy. There are many possible things to work on (just look at the list of in-flight proposals!). In order to convince anyone to work on this, some evidence is needed why this is impactful. "I think it could be useful for X, Y, Z" does not qualify as sufficient evidence, neither does "I think this will be useful once Wasm runs on GPUs". (2) Assuming someone (maybe yourself?) starts working on this, to make a proposal advance through the phases, feedback will be required: is it working correctly? Is it as helpful as expected? Is it performing well enough? If all you have is a proposed specification and a prototype implementation, you can't really answer these questions. There'll have to be a project that says "we previously couldn't adopt Wasm, but now we can thanks to this" or "our previous Wasm port was slow, but adopting this feature yielded great improvements", or similar. (Of course, it could also happen that the feedback is "we thought this would unblock us, but then we experimented with the prototype and it turned out that it doesn't actually solve our problem", which would be very useful feedback as well, and could cause either redesign or abandoning of the proposal.)

To reiterate: nobody has said that we should never add support for control over rounding modes to Wasm; on the contrary, people were generally supportive of this feature. But to actually start working on it, there needs to be evidence of concrete usefulness. And to decide what the details should look like, there needs to be a collaboration with actual users of the feature.

If your statement:

you're obviously not going to find people ranting that they can't get intervals to be performant (or even correct, due to the lack of compliant libraries) in WASM

is another way of saying "yeah, admittedly, nobody is asking for this", then realistically that means that nobody is going to work on it either, unless and until someone does ask for it.

KloudKoder commented 1 year ago

I have a few progress updates on this. Basically I've been trying to contact the team leaders or authors of various interval applications across a variety of languages. Based on the feedback I've received, the common themes are as follows:

For my own part, there are a few interval-dependent projects that I've wanted to deploy in WASM for a long time, but of course I realize that the point here is to surface third parties with a similar problem. (I haven't complained about this on any forum, so my own problems in this regard are no more searchable than anyone else's.)

It's a tough slog but I haven't given up.

Chris00 commented 1 year ago

I'd like to chime in to support this proposal. I'm a collaborator to inari, a Rust interval arithmetic library. We would definitely like to have this library work on WASM. I use inari to do computer assisted proofs and it would be great to be able to make some of them available to colleagues and students as web apps. I am not the only one in need of this: someone ported inari to WASM but, due to the lack of rounding primitives, the author of this library chose to forego the inclusion guarantees that are essential to any valid interval arithmetic library. 😬

The author of inari agrees to port the library to WASM once the required primitives are in core::arch::wasm32 and I agree to the perform some testing. That implies that you will have to collaborate with the developers of Rust so we can make use of the rounding primitives.

I hope that directed rounding primitives will materialize in WASM. That will be good for the web but also to send executables to colleagues who don't program so can then play with them safely, whatever their operating system (executing them with Wasmtime).

titzer commented 1 year ago

Thanks for that info @Chris00.

I support the addition of a rounding immediate to all floating point bytecodes that do rounding. It turns out to also be useful in situations short of full-blown interval arithmetic; e.g. comparing a floating point number to an integer that cannot be represented exactly. E.g. if we compare f < i, then we want to round i towards positive infinity before doing a float comparison against f, and if we have i < f, we want to round i towards negative infinity before comparison.

Unfortunately because of the binary encoding, that means encoding rounding modes in immediates requires new, prefixed bytecodes. (in hindsight, I wish we had reserved a 0 byte for floating point opcodes).

KloudKoder commented 1 year ago

@Chris00 "That implies that you will have to collaborate with the developers of Rust so we can make use of the rounding primitives." It's possible to port Inari to WASM even without Rust support (although Rust would be the next obvious target for the inclusion of rounding primitives). You could do this by manually coding a few lines of WASM instructions in order to obtain the desired behavior (once primitives are supported), then calling the resulting WASM module from Rust as needed with some sort of wrapper that works on the library or LLVM level. Then at some point if Rust agrees to nativize support, you can remove the wrapper. Granted, this approach is bit awkward, but you could have your library available on the Web before the Rust team even decides what they want to do. Right?

@titzer Yes indeed. Totally forgot about that. It happens if, say, your 64-bit integer won't fit into a 64-bit float because the 53-bit mantissa is so small. (This didn't happen on the 8087 math coprocessor from 1982 because it converted everything internally from 64-bit to 80-bit precision, unless you manually downshifted to 64-bit precision through the PC bits. But in SSE2, which I assume is how WASM implements its floating-point, it's really truly just 64-bits, so you have the problem. Ditto on any other modern architecture I'm aware of.) At that point, the comparison result is sort of random because you can't control the rounding policy. Speaking of which, if this feature request gets implemented, then we should look at the existing integer-to-float and float-to-integer functionality in WASM and ensure that it, too, ends up with the same 4 modes as all the other floating-point instructions.

KloudKoder commented 1 year ago

Anyone still not satisfied with the use cases on the table?

@conrad-watt Can we get something off the ground here so that the Inari team can prototype their library, probably along the lines of the switch statement macro you proposed? Let me know what you need.

tlively commented 1 year ago

I think we have enough motivation here to spin up a phase 1 proposal for per-instruction rounding mode control. That's just a matter of writing down the proposed instructions and putting a presentation and phase 1 poll on a CG meeting agenda. The tricky part will be after that when you have to find folks (possibly including yourself) with the time and motivation to work on implementations.

KloudKoder commented 1 year ago

@tlively Sorry for the slow reply. If you have any link to a previous phase 1 proposal that you think is well written, please share it. I'll try to make mine conform to the same format so as to avoid confusion. Once it's done I'll schedule a review and poll at a future CG meeting.

I'm willing to help on the implementation (with the caveat that I'm hardly a WASM expert).

tlively commented 1 year ago

There's no specific format you need to match, and the exact content will depend on the proposal. Here's an example of a phase 1 proposal presentation for a change to validation: https://docs.google.com/presentation/d/1-ajjGZpjAiGYOJlwswij9Mq6YGltmuELg3tbhu30VrI. That presentation was given at this meeting: https://github.com/WebAssembly/meetings/blob/main/main/2020/CG-09-29.md.

Basically you want to cover 1) the motivation(s) for the proposal, 2) an initial design for what would be implemented, and 3) an explanation of how it would be used and by whom.

KloudKoder commented 1 year ago

@Chris00 @tlively (and all concerned). Working on this. I'm trying to get the required instructions down to as little complexity as possible. The 4 standard RMs are a must, but there also needs to be sign inspection as well (mainly for multiplication and division of intervals). I notice that WASM already has f32/64.copysign, but when I click on the link on this page, it redirects to a page about float instructions which doesn't discuss the copysign itself. (The whole WASM site seems to have this weird CSS setting where you can't search for text within a page because the hits are not highlighted, making it very difficult to search for specific definitions.)

In any event, f32/f64.copysign isn't sufficient because the output is just another float. What's needed is a way to efficiently read the high bit of a float. I guess this can be done via f32/64.reinterpret_i32/64, but that sort of idiomatic casting makes it hard for the compiler to do its job efficiently, so "Get Logical Sign" and "Get Arithmetic Sign", both from float to integer, would be useful to have.

The other issue concerns -0.0 vs +0.0. There are ways of redefining the floats so that this distinction is meaningful, but in the interval world, they're identical because they amount to the same boundary. Allowing the distinction to persist throughout a given computation can create the appearance of unequal bounds, which can result in improper branching. (For example, the reciprocal of (-0.0, +0.0) is the reals, whereas the reciprocal of (+0.0, +0.0) is +infinity.) Therefore, we need a way to flush them both to +0.0. This would likely occur often because multiplication (and division) have radically different results depending on the particular signs of the respective upper and lower bounds. Therefore I'm leaning toward the need for a "Unify Zeroes" instruction, but again, it could be done through either (1) more branching on every single multiply or (2) hacking -0.0 to +0.0 by passing through the integer world.

One small piece of good news is that I now suspect we can avoid (1) any sort of "Get ULP" instruction and (2) the need for outward rounding (antichop). This is because you need to inspect the signs of the bounds of each interval that you're multiplying before doing so, so therefore you already know how to round each of the resultant bounds in light of their signs.

Any thoughts on this are welcome. Barring that, I'll just partition my proposal into "necessary" vs. "nice-to-have".

sunfishcode commented 1 year ago

In any event, f32/f64.copysign isn't sufficient because the output is just another float. What's needed is a way to efficiently read the high bit of a float. I guess this can be done via f32/64.reinterpret_i32/64, but that sort of idiomatic casting makes it hard for the compiler to do its job efficiently, so "Get Logical Sign" and "Get Arithmetic Sign", both from float to integer, would be useful to have.

Another option is to copysign the sign bit onto a 1.0 value, and then compare that. That said, I wouldn't necessarily object to a "read the sign bit into an integer" instruction. I'd guess that comparing as less-than-zero using the lt instruction suffices for the "arithmetic sign".

The other issue concerns -0.0 vs +0.0. There are ways of redefining the floats so that this distinction is meaningful, but in the interval world, they're identical because they amount to the same boundary. Allowing the distinction to persist throughout a given computation can create the appearance of unequal bounds, which can result in improper branching. (For example, the reciprocal of (-0.0, +0.0) is the reals, whereas the reciprocal of (+0.0, +0.0) is +infinity.) Therefore, we need a way to flush them both to +0.0. This would likely occur often because multiplication (and division) have radically different results depending on the particular signs of the respective upper and lower bounds. Therefore I'm leaning toward the need for a "Unify Zeroes" instruction, but again, it could be done through either (1) more branching on every single multiply or (2) hacking -0.0 to +0.0 by passing through the integer world.

An efficient way to unify zeros is to add 0.0 to a number, since -0.0 + 0.0 evaluates to 0.0.

That said, I'm not very familiar with interval arithmetic, but this approach to interpreting negative zero surprises me. In the reciprocal function f(x) = 1/x, the limit as x approaches zero isn't just positive infinity. It's positive infinity or negative infinity, depending on which direction you approach zero from. I would have guessed the interval arithmetic would want to represent a conservative bound, which would mean you wouldn't it want to ignore the possibility that a value with an interval at zero might represent a zero approached from the negative side, or even just a finite negative value which got rounded to zero.

KloudKoder commented 1 year ago

@sunfishcode You raised some important considerations.

First of all, at least in AMD64, you can't just do comparisons on floats. There needs to be a signal propagation to the RFLAGS register (in the integer unit) in order to conduct a subsequent branch. There are lots of ways to do this, but the point is that it takes a number of instructions and, unfortunately, some heavy pipeline synchronization. (This is hidden to some extent by branch prediction, but it's still costly.) That's why I think we're stuck with setting an integer based on the result of the float comparison. Unless I'm mistaken, WASM doesn't have any equivalent of RFLAGS, so producing an integer result is the next most efficient approach:

"Get Logical Sign" means: "Copy sign bit from float into integer, and clear all but the LSB of the integer".

"Get Arithmetic Sign" means: "If float is -0.0 or +0.0, then set integer to 0. Otherwise, if float is negative (anything, even NaN), then set integer to -1. Otherwise set integer to 1."

I actually tested (-0.0)+(+0.0) and got +0.0. I'm not in a position to expediently verify whether this is standard in IEEE754, but if it is, then you deserve major credit for eliminating the need for "Unify Zeroes"! (If nobody else knows then I'll have to dig in.)

As to the 1/x limits thing, yeah, it's a serious problem. (Namely, math doesn't have a notion of -0.0, but IEEE754 does because it's used idiomatically as (-epsilon).) The fundamental question is whether or not you want to retain sign information. In other words, should the reciprocals of (-0.0, -0.0), (-0.0, +0.0), and (+0.0, +0.0) all be one and the same, or unique? We could just say "It's whatever the RM says it is under IEEE754". And we'll have to. But does that then leave the library providers with cases that they can't efficiently handle? If so, then we may need another instruction. (I'm leaning towards that being unnecessary, but I'm not super confident that I haven't overlooked something.)

I guess one way to resolve this impasse is to say that if you do want those intervals to be handled as one and the same, then you're going to have to first zero-unify the limits by adding +0.0 to both of them.

What do you think?

Ping @Chris00

sunfishcode commented 1 year ago

"Get Logical Sign" means: "Copy sign bit from float into integer, and clear all but the LSB of the integer".

f64.copysign(1.0, x) < 0 implements this. But I wouldn't necessarily be opposed to adding a dedicated opcode for this if there's a performance case for it.

"Get Arithmetic Sign" means: "If float is -0.0 or +0.0, then set integer to 0. Otherwise, if float is negative (anything, even NaN), then set integer to -1. Otherwise set integer to 1."

Ah, ok. This looks a little more involved, but it probably also comes down to how much a dedicated opcode would help performance in realistic use cases.

I actually tested (-0.0)+(+0.0) and got +0.0. I'm not in a position to expediently verify whether this is standard in IEEE754, but if it is, then you deserve major credit for eliminating the need for "Unify Zeroes"! (If nobody else knows then I'll have to dig in.)

It is standard in IEEE 754, and it's in the testsuite.

Also, I figured out the error in my question about negative zeros. If there's a non-zero negative value which rounds up to -0, interval arithmetic would already have a non-zero lower bound that is less than it. So whenever you'd have a -0 lower bound, it really is the same as +0. There still are some tricky situations with approaching zero from the negative direction, but I expect these are just part of the trickiness that comes with treating "infinity" like a number.

KloudKoder commented 1 year ago

First of all, I guess you've really found a way to avoid having "Unify Zeroes". Kudos!

Secondly, what about your f64.copysign(1.0, x)? According to the list of instructions, wouldn't that only work if x were a float? I see only one definition for f32/64.copysign, which is "[f32/64 f32/64] -> [f32/64]" (which really doesn't make sense because it has 2 inputs). But when I put "copysign" in the search dialog I get a whole bunch of nonspecific links to nothing in particular about copysign. Someone will hopefully make the website easier to navigate at some point, but for now I guess I'm stuck unless I can dig out a more navigable spec somewhere (or even a lousy PDF). Anyways, let's sync this issue first so we understand what value a new opcode would or would not offer.

Third, back to -0.0. Consider this:

(-1.0, -0.9)/(+infinity) = (-0.0, -0.0) (-1.0, +0.9)/(+infinity) = (-0.0, +0.0)

But as you effectively pointed out, both of these intervals are just points, namely zeroes. This is literally correct interval arithmetic even after rounding (and IEEE754 behaves as such, I believe). The problem is that now your code thinks the outputs are different because, hey look, the bits don't match (even if you f32/64.reinterpret)! So you branch off the wrong way and it's all downhill from there. However, I now think the solution is simply to use your +0.0 addition trick whenever and wherever it matters; we don't need any special instruction to fix this, even if the code has to worry about it all the time.

Unless maybe @unageek (or anyone else lurking) has some problem with it.

tlively commented 1 year ago

The semantics of copysign are given here: https://webassembly.github.io/spec/core/exec/numerics.html#op-fcopysign

copysign_n(z1, z2)

  • if z1 and z2 have the same sign, then return z1
  • else return z1 with negated sign.
titzer commented 1 year ago

It's also worthwhile noting that fN.copysign is not typically a hardware instruction; at least not on any of the ISAs that V8 supports. It's done with a couple of instructions of bitwise logic.

KloudKoder commented 1 year ago

@tlively @titzer Thanks. Definitely copysign isn't going to fit the bill, unfortunately. It's necessary to branch based on the state of signs prior to interval multiplication, for instance. Because this is going to be done as often as interval multiplication itself, this would support the creation of the aforementioned pair of "get sign" type of instructions.

I'll be creating a proposal for CG review soon. Meanwhile, anyone with further concerns is invited to speak up.

KloudKoder commented 1 year ago

I've been reading through the WASM instruction set reference (thanks @tlively) so as to try and not mess this up. Here's what I'm thinking:

There are a few ways to do this, but ultimately the simplest is probably with "metaops", i.e. instructions that tell the compiler how to compile. (This could be used for lots of other stuff in the future like branch hints, cache proximity hints, reentrant compilation, atomic transactions, secure enclave implementation, etc.) We might as well get started with a generic approach that could be repurposed for all those other potential uses down the road.

Rather than have a single opcode for all metaops, which would just result in another level of indirection, we can assign one for each specific purpose. By definition, metaops don't do anything visible in and of themselves; they have side effects which may manifest on subsequent instruction boundaries, or only in the analog world, e.g. power consumption or latency effects.

In this case, we need 4 opcodes: round nearest-or-even (implied in effect for backward compatibility), round negativeward, round positiveward, and chop. In the interest of simplicity, it doesn't matter if you use these as prefixes (i.e. immediately prior to a float instruction) or not. You could even be silly and put lots of them in a long chain after a return instruction, if you like. (Why tax the validator with such innocuous concerns?) In any case, their effect should persist until either a different mode is selected, or one of the following instructions is encountered: br, br_if, br_table, return, call, or call_indirect. Those instructions are guaranteed to reset back to nearest-or-even because we don't want to deal with weird cases where a branch target ends up with a superposition of RMs, or a call instruction ends up bouncing off the OS and suffers some resulting reset of the float pipes (can that even happen?). Ultimately, as I've said before, the optimizer should rearrange instructions so as to minimize the number of RM control register touches (but this is not required).

And BTW the RMs affect not only arithmetic instructions, but the sort of conversions mentioned above, wherein the target lacks sufficient precision to fully represent the source. It's even possible that one RM will produce a finite approximation while another RM will generate an infinity in response to the same input.

There are also the aforementioned sign extraction instructions, but those are straightforward.

Thoughts?

tlively commented 1 year ago

@KloudKoder, the direction sounds good! As far as terminology goes, I would describe these new instructions as normal instructions that modify new rounding mode state made explicit in the store rather than as "metaops." We want to avoid the impression that the behavior of these instructions is meant to be outside the formally specified semantics.

For features that actually are outside the specified semantics, there is a separate effort being led in the branch hinting proposal to develop a framework for code annotations. I want to emphasize that rounding mode control should not be using that framework because it does affect semantics.

I believe @titzer previously expressed a preference for adding individual new arithmetic and conversion instructions with explicit rounding behavior rather than introducing new rounding mode state and instructions to modify it and I share that preference. At least for optimizing engine tiers, I wouldn't expect this to have a performance difference, especially since you envision the rounding mode being reset at the end of blocks anyway. It would be great if you could comment on the trade offs you see between these two designs as part of your presentation.

KloudKoder commented 1 year ago

@tlively It sounds like you're advocating for 2 different schemes. (Sorry if I'm just misinterpreting your comment.) To clarify, do you actually want many "new arithmetic and conversion instructions with explicit rounding behavior" or 4 new instructions with "the rounding mode being reset at the end of blocks"? (One of the foregoing plus the 2 sign extractors, actually.)

Yep, I got your point about metaops, which is fine.

titzer commented 1 year ago

It's an unfortunate encoding mistake, in retrospect, that we didn't reserve a 0 byte for floating point ops. But as it is, I think we could carve out a prefix page for floating point ops that include an explicit rounding mode, and then put (actually, repeat) all ops that use a rounding mode under that prefix, with the rounding mode explicit in them.

My preference against the implicit rounding mode state option is that it makes otherwise pure instructions dependent on that global state. I get it that most hardware works this way, but my impression is that hardware architects don't like that either. RISC-V actually has both: the rounding mode field is large enough to encode 4 rounding modes plus a dynamic rounding mode. AFAICT they did this because of the software dependency on managing the rounding mode state inherited from previous architectures.

rossberg commented 1 year ago

Strongly agree with what @titzer and others have said: an ambient mode state should be avoided, not least because it has terrible composability, i.e., gets in the way of transformations like inlining.

KloudKoder commented 1 year ago

@titzer @rossberg Fine by me, with the understanding that your validator now has to worry about the following:

  1. No control transfers allowed to the byte after the RM prefix. So you can't jump into the middle of "chop multiply" in order to just "multiply" for instance.
  2. No redundant prefixes, e.g. "nearest-or-even chop divide".
  3. No prefixes to nonfloat instructions, e.g. "round-positive return".
  4. No prefixes to float instructions with no RM sensitivity, e.g. "round-negative f32.min".
  5. Functions (and even just scopes) cannot terminate with prefixes.

Implicitly, prefixes have no lifetime beyond the prefixed instruction itself; it all reverts to nearest-or-even thereafter.

OK?

rossberg commented 1 year ago

@KloudKoder, technically, none of this affects validation, these are solely questions of binary format parsing, i.e., syntax. In particular, note that (1) isn't an issue at all, since branch targets are given by block labels, not random code positions.

Also, when @titzer says prefix, I think he doesn't mean mode prefixes, but the likes of group prefixes 0xfc, 0xfd, 0xfe, which we already use for all instructions related to GC or SIMD etc. Not sure if new float instructions would even need their own prefix, they could easily go under the existing 0xfc, which already hosts saturating truncations, for example.

In short, this is neither new nor a complication. :)

KloudKoder commented 1 year ago

@rossberg Thanks, I get what you mean. So in fact just 3 additional flavors of each RM-sensitive instruction, all probably living under the 0xFC opcode group, along with the 2 sign extractors wherever we can find the space. Alright, then, I'll compose a proposal and report back.

KloudKoder commented 1 year ago

Alright, I think I've identified every single instruction whose behavior would be altered by a change in RM. They're as follows:

91 f32.sqrt 92 f32.add 93 f32.sub 94 f32.mul 95 f32.div 9F f64.sqrt A0 f64.add A1 f64.sub A2 f64.mul A3 f64.div B2 f32.convert_i32_s B3 f32.convert_i32_u B4 f64.convert_i64_s B5 f64.convert_i64_u B6 f32.demote_f64 B7 f32.convert_i32_s B8 f32.convert_i32_u B9 f64.convert_i64_s BA f64.convert_i64_u FD 5E f32x4.demote_f64x2_zero FD E3 f32x4.sqrt FD E4 f32x4.add FD E5 f32x4.sub FD E6 f32x4.mul FD E7 f32x4.div FD EF f64x2.sqrt FD F0 f64x2.add FD F1 f64x2.sub FD F2 f64x2.mul FD F3 f64x2.div FD FA f32x4.convert_i32x4_s FD FB f32x4.convert_i32x4_u FD FE f64x2.convert_low_i32x4_s FD FF f64x2.convert_low_i32x4_u

I couldn't find any documentation on the last 2, so I need to confirm that they're actually impacted by RM. (Are they taking 2 products of 2 i32s? Or concatenating them or what?) If I've neglected or wrongly included something, please speak up.

In keeping with the KISS principle, and considering that we're in no desperate shortage of root-level opcodes, I'd like to propose the creation of 3 additional prefixes (tentatively F9, FA, and FB), which serve as RM indicators for (round-positive, round-negative, and chop, respectively). This would match the existing opcode order for existing "ceil", "floor", "trunc", and "nearest" instructions provided in the spec (but FC would be skipped). This would, in turn, result in minimal complexity growth for upstream compilers. It would also minimize the pain of adding additional float or SIMD instructions in the future. The rules would be simple:

  1. If you're using nearest-or-even with any instruction in the list, then nothing changes.

  2. If you're using round-positive, round-negative, or chop; then use prefix opcodes F9, FA, or FB, respectively. Follow this by the equivalent nearest-or-even encoding, but skip the leading FD if it's present.

In the interest of simplicity, I'll post about sign extraction separately.

KloudKoder commented 1 year ago

Now for the sign extractors. While such instructions would be useful for a wide variety of applications other than interval arithmetic, I'm just going to focus on the latter. To summarize, there are 8 cases from which to choose upon each multiply:

(+a, +b) (+c, +d) = (+ac, +bd) (-a, +b) (+c, +d) = (-ad, +bd) (-a, -b) (+c, +d) = (-ad, -bc) (-a, -b) (-c, +d) = (-ad, +ac) (-a, +b) (-c, -d) = (-bd, +ad) (+a, +b) (-c, -d) = (-bc, -ad) (-a, -b) (-c, -d) = (+bc, +ad) (-a, +b) (-c, +d) = (min(-ad, -bc), max(+ac, +bd))

(There are 16 cases if you include antiintervals (+a, -b), which is sometimes necessary.) As shown, selecting a case requires extracting sign bits. (Doing arithmetic comparison against +0.0 would be overkill and actually doesn't always behave like sign bit extraction.) Alternatively, you could simply compute ac, ad, bc, and bd, then apply both min and max to obtain the resulting interval. That's really expensive, though, not the least of which because there's no "min/max of all lanes" SIMD instruction, so you need to serialize 2 min/max instructions. (Granted, you need to do so anyway in the last case, but it's rare.) To make matters worse, the 8 cases above all feature each variable with both signs, so it might be most efficient to use indirect branches from a sparsely populated table of 16 labels. Which brings us back to "really expensive". Even if the programmer knows that some of the cases are impossible, the compiler might not know that, so the best it can do is optimize for common cases such as the first one. Which is why, in practice, it's probably going to be fastest to just nest 4 predicates. Which, finally, is why sign extractors matter.

I should emphasize, again, that division is even worse because dividing by any interval including +/-0.0 will give rise an antiinterval, hence the need for an arithmetic sign extractor to complement the bitwise one. (Insert debate about conflicting limits from left and right here.) And this time, we can't even avoid the problem by using min/max. It's just ugly.

I would just add that all of the above explains why I think actual interval instructions would be advantageous. I know that's not a popular opinion because doing so would not expose anything which is currently atomic on the CPU level (but it might be in the future). The problem is that there's no good idiom for "hey I'm going to multiply 2 intervals, so do what I mean, not what I say". So the compiler has to take everything quite literally and cause loads of superfluous communication among floats, integers, flags, and the branch predictor. I thought I should make this clear, even if such instructions are categorically off the table for the moment, as I don't want anyone to be surprised when native assembly language is still multiples faster than even the RM-aware WASM equivalent.

OK so here are the sign extractors that I'd like to propose, absent full-blown native interval support:

C5 f32.sgn (32-Bit Get Sign)

This is a conventional branch preparation instruction, so to be consistent, it has this name instead of the more verbose "i32.sgn_f32". This instruction just copies the sign bit of f32 into i32 without regard to any other bits in f32.

C6 i32.asgn_f32 (32-Bit Get Arithmetic Sign)

This instruction sets i32 to zero if and only if f32 is +/-0. Otherwise it's equivalent to (1-2*f32.sgn).

C7 f64.sgn (64-Bit Get Sign) C8 i32.asgn_f64 (64-Bit Get Arithmetic Sign) C9 i32x4.sgn_f32x4 (4x32-Bit Get Sign) CA i32x4.asgn_f32x4 (4x32-Bit Get Arithmetic Sign) CB i64x2.sgn_f64x2 (2x64-Bit Get Sign) CC i64x2.asgn_f64x2 (2x64-Bit Get Arithmetic Sign)

And the following instructions would help immensely, and not just for interval arithmetic, because lots of code branches depending on signs:

CD i32.psgn_f32x4 (Packed 4x32-Bit Get Sign)

This indirect branch preparation instruction applies f32.sgn to lane N and deposits the result into bit N of i32.

CE i32.pasgn_f32x4 (Packed 4x32-Bit Get Arithmetic Sign)

This indirect branch preparation instruction applies i32.asgn_f32 to lane N compresses the results into the 2 successive corresponding bits of i32.

CF i32.psgn_f64x2 (Packed 2x64-Bit Get Sign) D0 i32.pasgn_f64x2 (Packed 2x64-Bit SIMD Get Arithmetic Sign)

tlively commented 1 year ago

In keeping with the KISS principle, and considering that we're in no desperate shortage of root-level opcodes, I'd like to propose the creation of 3 additional prefixes (tentatively F9, FA, and FB), which serve as RM indicators for (round-positive, round-negative, and chop, respectively)

We generally try to be conservative about using new prefix opcodes because we expect WebAssembly to be around for a long time and because they come out of the same space we could use for new single-byte opcodes. It would be better to put all the new instructions behind prefix FC, as @rossberg suggested.

It might also be a good idea to leave out new SIMD instructions for now, unless you can show that there are efficient lowerings for the alternative semantics on common hardware and libraries that would use these instructions.

KloudKoder commented 1 year ago

@tlively Understood. I'll rehash everything under FC. And when you say to leave out the SIMDs, are you referring to only the sign extractors? I assume so because there are definitely efficient hardware instructions for ordinary arithmetic SIMDs of all RMs, at least on AMD64; these are also helpful for combining parallel interval arithmetic instructions at the source code level into a single hardware operation (to great effect, since 32 bits (x4) goes a long way when you have error containment). But please clarify.

tlively commented 1 year ago

In general there is a high bar to add SIMD instructions. You’ll need to show that there are efficient lowerings for all the new SIMD instructions on Arm and x86, and ideally risc-v as well, with details about which ISA extensions would be necessary. If you can show that, then including the SIMD instructions should be fine.

KloudKoder commented 1 year ago

@tlively Well it turns out that the SIMD arithmetic instructions don't have efficient ARM8 equivalents. So I'll pull those out, but what about the sign extractors? Those are all integer operations and definitely have efficient (but not necessarily single-instruction) equivalents on all modern architectures.

EDIT: On further thought, the SIMD sign extractors aren't useful enough to justify, absent custom RMs, so I'll kill them too. I'll come back later with an updated proposal.

KloudKoder commented 1 year ago

OK how about this: The following RM-aware float instructions would all be prefixed with FC.

20 f32.sqrt_ceil 21 f32.sqrt_floor 22 f32.sqrt_trunc 23 f32.add_ceil 24 f32.add_floor 25 f32.add_trunc 26 f32.sub_ceil 27 f32.sub_floor 28 f32.sub_trunc 29 f32.mul_ceil 2A f32.mul_floor 2B f32.mul_trunc 2C f32.div_ceil 2D f32.div_floor 2E f32.div_trunc 2F f64.sqrt_ceil 30 f64.sqrt_floor 31 f64.sqrt_trunc 32 f64.add_ceil 33 f64.add_floor 34 f64.add_trunc 35 f64.sub_ceil 36 f64.sub_floor 37 f64.sub_trunc 38 f64.mul_ceil 39 f64.mul_floor 3A f64.mul_trunc 3B f64.div_ceil 3C f64.div_floor 3D f64.div_trunc 3E f32.convert_i32_ceil_s 3F f32.convert_i32_floor_s 40 f32.convert_i32_trunc_s 41 f32.convert_i32_ceil_u 42 f32.convert_i32_floor_u 43 f32.convert_i32_trunc_u 44 f64.convert_i64_ceil_s 45 f64.convert_i64_floor_s 46 f64.convert_i64_trunc_s 47 f64.convert_i64_ceil_u 48 f64.convert_i64_floor_u 49 f64.convert_i64_trunc_u 4A f32.demote_f64_ceil 4B f32.demote_f64_floor 4C f32.demote_f64_trunc 4D f32.convert_i32_ceil_s 4E f32.convert_i32_floor_s 4F f32.convert_i32_trunc_s 50 f32.convert_i32_ceil_u 51 f32.convert_i32_floor_u 52 f32.convert_i32_trunc_u 53 f64.convert_i64_ceil_s 54 f64.convert_i64_floor_s 55 f64.convert_i64_trunc_s 56 f64.convert_i64_ceil_u 57 f64.convert_i64_floor_u 58 f64.convert_i64_trunc_u

And likewise with these sign extractors with no need for RM:

1C f32.sgn (32-Bit Get Sign (to i32)) 1D i32.asgn_f32 (32-Bit Get Arithmetic Sign) 1E f64.sgn (64-Bit Get Sign (to i32)) 1F i32.asgn_f64 (64-Bit Get Arithmetic Sign)

tlively commented 1 year ago

Nice, looks good to me.

One more note is that typically the prefixes instruction names correspond to the result type of the instruction, so I would expect f32.sgn to start with i32. I would also be prepared for further bikeshedding on names once this is presented to the full group.

sunfishcode commented 1 year ago

It's worth considering making the rounding mode an immediate field rather than being folded into the opcode. Something like this:

FC 20 00 f32.sqrt_ceil FC 20 01 f32.sqrt_floor FC 20 02 f32.sqrt_trunc FC 21 00 f32.add_ceil FC 21 01 f32.add_floor FC 21 02 f32.add_trunc

and so on.

Upsides: This makes it cleaner to extend the feature with new rounding modes in the future, if we ever want to do that. There are algorithms that use round-to-odd, round-away-from-zero, and other things. We don't have popular hardware for them today, but who knows what the future may hold. And this leaves more of the one-byte space under the FC prefix available for future features.

Downside: This makes the instructions one byte longer. However, floating-point operations don't tend to be the most frequent opcodes even in floating-point-heavy code, so it should have a low overall impact on code size.

KloudKoder commented 1 year ago

@tlively Maybe I'm confused. So that's why I stated above that "[f32.sgn] is a conventional branch preparation instruction, so to be consistent, it has this name instead of the more verbose "i32.sgn_f32"." Are you saying that f32.sgn and friends should not be consistent with, say, f32.eq? Fine either way but I want to make sure.

@sunfishcode Yeah, I thought of this but I figured that we would never see all the major silicon vendors supporting these other modes. And the opcode map is already sort of chaotic, so if antichop suddenly appears tomorrow, then we could just add the same instructions elsewhere in their own contiguous region of FC or whatever map. If there's a new float instruction (without a new RM), then we can just inject it after the last triplet.

But if y'all still want a broken-out RM byte, then fine by me. And would you want to wrap the precision bit into the RM byte as well, so we don't need separate 32/64?

tlively commented 1 year ago

Oh yeah, you’re right that f32.eq and the other comparison instructions are already exceptions to the rule I described. The names you have seem fine for now, and we can discuss with the full group of we should make any changes.

sunfishcode commented 1 year ago

@sunfishcode Yeah, I thought of this but I figured that we would never see all the major silicon vendors supporting these other modes. And the opcode map is already sort of chaotic, so if antichop suddenly appears tomorrow, then we could just add the same instructions elsewhere in their own contiguous region of FC or whatever map. If there's a new float instruction (without a new RM), then we can just inject it after the last triplet.

You're not wrong :-), but my instinct is that the immediate field is worth doing.

But if y'all still want a broken-out RM byte, then fine by me. And would you want to wrap the precision bit into the RM byte as well, so we don't need separate 32/64?

Splitting out 32/64 isn't necessarily a bad idea, but since we didn't do that in the earlier float opcodes, my inclination would be to keep them separate here too.

KloudKoder commented 1 year ago

@sunfishcode OK we can do that. I don't see anyone complaining as of yet. Let me get a final presentation together and schedule for a CG meeting date.

KloudKoder commented 1 year ago

I've submitted a request to debate this proposal at the December 20 CG meeting. I also have an updated synopsis of the proposed changes based on the above discussion (but I'm not presently sure which document format to post where).

KloudKoder commented 1 year ago

Summarizing the current proposal for those in the CG meeting who want something in writing:

This proposal is for the compiler prototyping of floating-point instructions which mirror those already in the spec, but with distinct rounding modes (RMs) which differ from the universally assumed nearest-or-even policy. The RMs involved include the 3 others which all major CPUs seem to support, namely "ceil" (round toward positive infinity), "floor" (round toward negative infinity), and "trunc" (chop toward zero without changing the sign bit). SIMD instructions aren't included because they don't yet have succinct equivalents on ARM.

All of these instructions will use the FC opcode prefix so as to conserve space in the main map. This prefix is followed by the secondary opcode, and finally an RM byte indicating one of the aforementioned RMs (or perhaps yet another one in the future). The opcode map ordering is identical to that of their corresponding progenitors so as to minimize compiler complexity. As previously, a suffix of "s" or "u" implies "signed" or "unsigned" operation, respectively.

The proposed map is thus as follows:

20 00 f32.sqrt_ceil 20 01 f32.sqrt_floor 20 02 f32.sqrt_trunc 21 00 f32.add_ceil 21 01 f32.add_floor 21 02 f32.add_trunc 22 00 f32.sub_ceil 22 01 f32.sub_floor 22 02 f32.sub_trunc 23 00 f32.mul_ceil 23 01 f32.mul_floor 23 02 f32.mul_trunc 24 00 f32.div_ceil 24 01 f32.div_floor 24 02 f32.div_trunc 25 00 f64.sqrt_ceil 25 01 f64.sqrt_floor 25 02 f64.sqrt_trunc 26 00 f64.add_ceil 26 01 f64.add_floor 26 02 f64.add_trunc 27 00 f64.sub_ceil 27 01 f64.sub_floor 27 02 f64.sub_trunc 28 00 f64.mul_ceil 28 01 f64.mul_floor 28 02 f64.mul_trunc 29 00 f64.div_ceil 29 01 f64.div_floor 29 02 f64.div_trunc 2A 00 f32.convert_i32_ceil_s 2A 01 f32.convert_i32_floor_s 2A 02 f32.convert_i32_trunc_s 2B 00 f32.convert_i32_ceil_u 2B 01 f32.convert_i32_floor_u 2B 02 f32.convert_i32_trunc_u 2C 00 f64.convert_i64_ceil_s 2C 01 f64.convert_i64_floor_s 2C 02 f64.convert_i64_trunc_s 2D 00 f64.convert_i64_ceil_u 2D 01 f64.convert_i64_floor_u 2D 02 f64.convert_i64_trunc_u 2E 00 f32.demote_f64_ceil 2E 01 f32.demote_f64_floor 2E 02 f32.demote_f64_trunc 2F 00 f64.convert_i32_ceil_s 2F 01 f64.convert_i32_floor_s 2F 02 f64.convert_i32_trunc_s 30 00 f64.convert_i32_ceil_u 30 01 f64.convert_i32_floor_u 30 02 f64.convert_i32_trunc_u 31 00 f64.convert_i64_ceil_s 31 01 f64.convert_i64_floor_s 31 02 f64.convert_i64_trunc_s 32 00 f64.convert_i64_ceil_u 32 01 f64.convert_i64_floor_u 32 02 f64.convert_i64_trunc_u

Considering that all of these instructions are directly relevant to interval arithmetic, the following primitives should be included as well, which would be frequently used in that context, e.g. upon every multiply operation. They are also prefixed with FC but don't contain (and don't need) an RM byte.

In the tradition of branch preparation instructions, f32.sgn and f64.sgn have abbreviated names rather than "i32.sgn_f32" and "i32.sgn_f64", respectively. (The arithmetic sign instructions technically aren't branch preparation instructions because they return ternary outputs, so they conform to the usual notation.)

1C f32.sgn (32-Bit Get Sign Bit (to i32)) 1D i32.asgn_f32 (32-Bit Get Arithmetic Sign (-1/0/1)) 1E f64.sgn (64-Bit Get Sign Bit (to i32)) 1F i32.asgn_f64 (64-Bit Get Arithmetic Sign (-1/0/1))

KloudKoder commented 1 year ago

Sorry this has taken so long. Anyhow, we now have a sandbox repo (big thanks to @dschuff ).

https://github.com/WebAssembly/rounding-mode-control

I've opened an issue over there for discussion about how to hack up a prototype. All, feel free to dive in and post your thoughts. I assure you that I have absolutely no idea what I'm doing with respect to the WASM codebase. At least, I can help with formal revisions to the PDF specification, as well as testing.

https://github.com/WebAssembly/rounding-mode-control/issues/2

mglisse commented 1 year ago

In case you are still looking for potential users, CGAL (computational geometry, meshing) is a C++ library that heavily relies on interval arithmetic (and exceptions) for performance: the fallback is bignum rationals. We have some slow emulation so the code can at least run, but that's really not good. We have had several users ask for wasm support for demos or web applications, and several told us they had to switch to a different (inferior) solution because of this. We do not mind if it is necessary to write a bit of code specifically for wasm, we already have plenty of platform-specific code in that area. On x86_64, we even use rounding for SIMD operations, but for wasm, rounding of scalar operations would already be great.

(sorry, I won't have time to help much, all I can offer is to try porting CGAL's Interval_nt once there is support for both compiling and evaluating, although I have no experience with wasm)

KloudKoder commented 1 year ago

@mglisse Cool! Sounds like CGAL would make a great test case and eventual use case. Ping us here again when we have a prototype working. I got my work cut out for me, in the meantime.

KloudKoder commented 9 months ago

@mglisse @Chris00 Guys, do you have any preference with respect to rounding toward positive or negative infinity? I ask because @whirlicote is going to be writing some optimized implementations for common use cases, and it all works faster if he sticks to a single rounding mode. (When rounding positive, rounding negative is sometimes implementable by just negating before and after the operation. So you can do that cheaply instead of switching the RM (expensively). And similarly in the opposite case.) This will, at least initially, be done via optimization hints in a Javascript context. We don't want to offer optimization for both cases because that's begging to create a divergence of coding practice which serves no constructive purpose. So it's either favoring ceil, or favoring floor. My personal preference is ceil because it's slightly more useful (on account of positive-only quantities which require bounding) but I don't have a strong opinion either way. To be clear, functional support for both will be provided, but we'll only offer optimization for one of them in addition to nearest-or-even.

mglisse commented 9 months ago

I can work with either. Currently in CGAL for intervals we use only rounding towards +inf (and the negation trick you mention), so it would be slightly easier to stick to that. The one thing that would suck would be for 2 compilers or browsers to optimize only one direction but not the same one...

Chris00 commented 9 months ago

In inari, rounding towards +∞ is most often used, so the library will be simpler to adapt if this is available.

CC: @unageek