How to portably use this proposal?

WebAssembly / half-precision

Proposal to introduce half precision operations

Other

5 stars 0 forks source link

How to portably use this proposal? #4

Open alexcrichton opened 1 month ago

alexcrichton commented 1 month ago

Hello! I had some comments during the CG meeting today but I wanted to elaborate on them in an issue here. My concern stems from not knowing how consumers will end up using this proposal in a bit more of an abstract sense beyond "well you just use it". My understanding is that "if you know what you're doing" you would first have a target platform in mind (e.g. arm64 or some specific machine), you'd then ensure that there's an engine to use there (e.g. all your own users use one browser or something like that), and then you'd get the instructions into a wasm binary and measure to make sure everything works. What I'm curious about is what to do if you don't fall into the bucket of "if you know what you're doing".

Put more concretely the current trajectory of this proposal looks like it's going to have two key properties that users would need to consider:

A wide performance gap on machines that natively support this proposal and machines that require emulation for the proposal.
Depending on the final level of complexity for engines it may also not be reasonable to expect to be able to run this proposal on older cpus which don't have even enough support for "ok we can at least mostly emulate things". (for example only sse2 on x64 or pre-8.2 on arm64 -- I'm not certain the complexity here on either s390x or riscv64)

The rest of my comments here are under the assumption that (1) and (2) are true. Ideally I think that those would both be fixed, but given the nature of this proposal I don't think it's easy to do that. Personally I like the idea of exposing hardware capabilities in wasm, but my thinking here is exploring the consequences of (1) and (2) and adoption in ecosystem libraries.

Personally I don't have any experience with 16-bit floats or the applications that use them. One hypothetical use case, which may not be reasonable, would be something along the lines of: a library has the ability to use f16x8 but also has the ability to use f32x4. On supporting hardware f16x8 is the fastest but f32x4 is faster on hardware that doesn't support f16x8 natively (or via emulation of f16x8). In this situation a library has no means of internally selecting which algorithm to use. This means that the implementation detail of float width would need to be bubbled up to users of the library, and that can increase complexity in the sense that every library with this decision would have to bubble up knobs of which to use at runtime.

Another possible concern for a library like this is whether or not to even include f16x8 instructions. For example let's assume that this proposal is phase 4, implemented in all engines, and has been stable for a year or so. Even in this situation should this hypothetical library use f16x8 by default? Or should it provide a compile time flag to disable support? This is relevent, for example, if users want to use the library in engines on platforms where the engine does not have support for this proposal (e.g. the fallback to old hardware is too complex).

Overall an answer to all of this could be: "it's expected all users of this proposal know what they're doing" or something like that. My concerns go away in this situation because library authors would "know what they're doing", applications using these libraries would "know what they're doing", and engines would be communicating with these "in-the-know" users for what platforms need to support things. For example if it turns out no one has a use case for supporting this proposal with sse2-only cpus then that would just never happen and no one would ever see the error of "your cpu is too old to run this module". This in my opinion of course touches on the concerns of wasm being "universally portable" though to a degree, but not if the users of this proposal are expected to be only a few critical ones and not necessarily see widespread adoption.

Personally though my naive way to answer these concerns of mine would be to (a) add the ability to detect whether these instructions are "fast" and (b) allow engines to implement these instructions with a trap. That would provide libraries the ability to internally detect which algorithm to use and everything could always be turned on by default. Engines that don't want to implement the proposal could only do the validation parts and then trap at runtime (and the detection instruction would say "no"). I realize though that this touches on other concerns about the determinism of wasm but I at least personally feel like that ship has sailed with the relaxed-simd proposal and would prefer to lean in to exposing the intrinsic nature of how hardware is different and modules get run on different hardware.

Apologies for the long issue, happy to discuss here more too! Also I realize I'm not a key stakeholder in this proposal nor in simd-related things. I've done various bits and pieces of wasm simd in Rust and Wasmtime for various architectures but I don't use simd day-in-and-day-out nor am I aware of users of Wasmtime, for example, that are cpu-bound and heavily reliant on simd things.

lukewagner commented 3 weeks ago

+1 to the analysis of the problem and to the suggested solution of adding a non-deterministic f16.is_fast instruction with well-defined traps if you call an f16 instruction when f16.is_fast is false.

Importantly, this doesn't run into most of the earlier concerns with the more generic open-ended feature-testing proposals because it still requires the engines to eagerly validate (and, more generally, "know about") all the code: it decouples the "engine hasn't implemented yet" problem from the "hardware can't do it (efficiently)" problem, focusing only on the latter. Non-determinism isn't great, of course, but assuming that the additional CPU functionality here is proven to be sufficiently valuable, I think it could be justified (just like it was with NaN payloads) and within the original WebAssembly theme of "limited, local non-determinism".

rossberg commented 3 weeks ago

-1 to having non-deterministic failure — which would be whole new level compared to everything we have so far — and to using feature detection as a permanent core-level solution.

lukewagner commented 3 weeks ago

Sure, nondeterminism is bad and shouldn't be added unless necessary, but assuming the CG as a whole isn't content with simply abandoning this feature (and any feature that has similar cliff-y performance across all CPUs), I think it's our least-bad option. To be clear, I don't think the CG has made that decision, but I assume that that can come to a future CG meeting and vote.

SPY commented 3 weeks ago

@alexcrichton thanks for elaborated analysis!

After numerous discussions during F2F meeting I'm fully sure FP16 proposal should be paired with a way to detect if it is supported in hardware or emulated. I'm not sure if should be done internally via f16.is_fast non-deterministic instruction or externally via Wasm.Features.FP16 == 'emulated' | 'native' flag in JS. Both approaches have own advantages and disadvantages. I suggest to discuss them on coming CG meetings. I believe an emulation detection will satisfy predictable performance promise of WebAssembly. And practically it is exactly what our partners want from fp16: switch for using it when it fast.

Also, I don't think f16x8 instructions should trap if fast path is not available. I believe it is unnecessary introduction of non-determinism and it contradicts the portability argument. For example pure software emulation for V8 is not free, but pretty manageable. I believe any engine supporting F32x4 instruction set can emulate with acceptable price.

sunfishcode commented 3 weeks ago

@SPY Do you know if the f16 emulation implementation that uses F32x4 performs correct rounding at each instruction? I know that many f16 use cases aren't sensitive to rounding, but it is something we should consider if we're talking about determinism.

alexcrichton commented 3 weeks ago

Personally I would advocate for an instruction rather than an external JS-specific construct. If feature-detection can't be done within wasm itself it wouldn't solve the use case of a library I outlined above because a library has no built-in way of determining which path it should take and requires external input which is typically significantly more difficult to thread through.

Also one part I can try to explain a bit more detail from above is the rationale for having a trapping implementation. If the pure software emulation of these instructions is minimal then it seems fine to not allow the instructions to trap. My point is that if the software emulation is complex enough that some engines don't want to support it then I think trapping should be allowed. This could be considered a design constraint to ensuring that pure software emultation is easy enough all engines can add it, but at the same time I also personally think it's ok to have something complex enough that software emulation isn't feasable. In this situation, though, I think that engines should be allowed to trap the instruction and well-behaved libraries would use the feature-detection instruction to figure out what to do and would never hit the trap.

Currently pure determinism is not a part of WebAssembly as a whole due to NaN bit patterns, resource limits like memory, and relaxed-simd instructions. I think it's important to be able to run WebAssembly deterministically but I see that as different than requiring all new features to WebAssembly are deterministic (especially when preexisting features are not deterministic).

Dealing with the non-determinism of WebAssembly is already something that's nontrivial to work with. In Wasmtime we maintain passes and configuration options to make instructions deterministic (e.g. canonical NaN payloads, deterministic behavior for relaxed-simd instructions, etc). We also have to handle non-determinism when performing differential fuzzing. Even on the same architecture v8 and Wasmtime can produce different NaN payloads without canonicalization enabled.

Personally, and again this is just my opinion, I think that WebAssembly should fully lean in to feature detection at the core level. That would make paving the path for proposals like this, and other future proposals, much easier. I also don't personally think it would sacrifice much from the current status of WebAssembly given the non-determinism that already exists and the knobs that are used to control that if desired.

rossberg commented 3 weeks ago

@alexcrichton:

I think that WebAssembly should fully lean in to feature detection at the core level. That would make paving the path for proposals like this, and other future proposals, much easier.

Isn't that essentially advocating for slippery slopes and exploiting them as an active strategy for continuously lowering the bar?

If you want to convince folks caring about the integrity of Wasm to be strongly opposed to adding feature detection, then you gave the perfect argument. What you describe is exactly what I fear will happen once we cross that line. ;)

cfallin commented 3 weeks ago

Is there a reason that a user program that wants to behave non-deterministically -- that is, use a fast FP16-based algorithm kernel when the underlying platform favors that, and use an F32 fallback or whatever otherwise -- can't do some sort of quick test at startup, given appropriate imports for timing? (*)

Basically what this is proposing is: on systems where execution time matters and we want to optimize strategies for it, let's lean into that one kind of nondeterminism, "execution time", because it can be reasoned about and decisions can be made at the user level, above the Wasm semantics.

This does chip away at the predictable-performance pillar of Wasm but it seems there's no option if the hardware landscape fundamentally has performance cliffs and we want a union rather than intersection of capabilities. At least the emulation would be "predictably slow"!

(*) One reason perhaps is a very fast overall execution time that would make this timing loop a meaningful part of total runtime, but I would guess (?) many use-cases where FP16 is considered are CPU-intensive kernels already, not quick one-off computations.

conrad-watt commented 3 weeks ago

This does chip away at the predictable-performance pillar of Wasm but it seems there's no option if the hardware landscape fundamentally has performance cliffs and we want a union rather than intersection of capabilities.

I think this is the "if" that some of our more ambitious SIMD proposals have hinged on. I personally think of Wasm as being closer to an intersection of capabilities. I think it's harder to argue for the "union" PoV while still thinking of Wasm as a virtual machine in itself.

rossberg commented 3 weeks ago

FWIW, Wasm was very much intended as an intersection of common CPU instruction sets originally, and I remember it being phrased exactly that way in early design meetings (though I can't find this in writing). That approach was key to its clean and simple design, which arguably has been an enabler of its success.

alexcrichton commented 3 weeks ago

I want to clarify that I had two basic assumptions in this issue: (1) that there's a wide performance gap on supporting an non-supporting CPUs, even with emulation, and (2) that the emulation support is so complex some engines won't want to add it. Wasm has historically to date not had to deal with this as both assumptions hae always been false, the performance gap has not been too large and emulation has not been that complex. My impression is that this proposal is changing that and proposing something that has a wide performance gap and a pretty complex fallback (complexity still TBD I think depending on things like how accurate rounding modes need to be like @sunfishcode mentioned or the precise set of instructions included in this proposal and allowed behaviors).

My point then is that I think that the wasm CG needs to grapple directly with these questions in the abstract rather than this proposal specifically itself. I think it would be best to decide that no proposal will ever be accepted if (1) and (2) are problems or to instead develop a framework by which to accept proposals where (1) and (2) are problems. If the CG decided to not want to ever deal with (1) and (2) then I think that puts a strong design constraint on this proposal to guarantee that the emulation is reasonable in performance or the platform support is broad enough. If the CG wants to accept (1) and (2) as problems then that's where the feature-detection and/or trapping behavior comes in.

Personally I'm in camp "accept proposals with problems (1) and (2) with feature detection". I am well aware this is not a universally held opinion and I'm aware that it's unlikely anyone will change their minds reading this. From my perspective there is not obvious-and-clear consensus in the CG about this problem and I believe it would be beneficial to get alignment on this.

dschuff commented 3 weeks ago

FWIW, Wasm was very much intended as an intersection of common CPU instruction sets originally, and I remember it being phrased exactly that way in early design meetings (though I can't find this in writing). That approach was key to its clean and simple design, which arguably has been an enabler of its success.

I would like to argue that, at least a little :) Clearly keeping the design simpler is desirable, and has been key in enabling a large variety of implementations to exist, and making research and specification easier. But success in the sense of adoption by users and developers who can decide whether or not they want to use wasm at all, is much more dependent on the capabilities of wasm compared to the other options that they have.

cfallin commented 3 weeks ago

To expand Alex's taxonomy, I think that there may be features that fit category (1) -- wide perf gap -- but not (2) -- i.e. emulation is not so bad. I'm still curious regarding this proposal in particular whether the fully-accurate emulation for any given f16 op is more on the order of "these 10 f32 ops / bit-ops" or "call this softfloat library". The reason I want to make this distinction is that the approach to (1)-but-not-(2) proposals I think has less of a hard dependence on feature detection as a core language feature and can instead lean on "let the user measure the timing and decide" proposed above (if needed).

dtig commented 3 weeks ago

FWIW, Wasm was very much intended as an intersection of common CPU instruction sets originally, and I remember it being phrased exactly that way in early design meetings (though I can't find this in writing). That approach was key to its clean and simple design, which arguably has been an enabler of its success.

The definition of success will probably vary based on who you ask. :)

IMO continued adoption is a significant contributor to success of a platform, and the discussion around maintaining the status quo seems to ignore that to a large extent. There is a philosophical argument here about what we expose to WebAssembly, and what tradeoffs we make. Going back to the original discussions, I'd like to argue that keeping the instruction set frozen to MVP or MVP + minor additions wasn't the goal, and we expected Wasm to evolve (as it has with several post-MVP proposals) with the needs of the ecosystems evolving around it. I expect we will continue to have this discussion going forward as well, but I do want to bring this discussion thread to focus a little bit more on the specifics.

The goal of this proposal isn't to deviate from the clean and simple design, but expose a fundamental type + set of operations that is increasingly seeing common use, and is not performant to use with Wasm. There's an inherent assumption here that this set of operations will be slow on one common architecture, and I'd like to question that assumption. It's quite interesting in this context to figure out when to fall back to software emulated code - there's still more work to be done to define that, but I would guess that would only be pre-AVX/AVX2 which should be fairly common across consumer hardware. This is again, a small set of ops, enabling key use cases that are bottlenecked currently on a significant performance gap.

The reason I want to make this distinction is that the approach to (1)-but-not-(2) proposals I think has less of a hard dependence on feature detection as a core language feature and can instead lean on "let the user measure the timing and decide" proposed above (if needed).

I think we're firmly in the world where users to already measure and decide, just given the varying levels of support for different features across engines, and between features specifically. Though fundamentally having feature detection as a core language feature or as a standardized API feature (if we're able to agree on one) would be a significantly better developer experience.

SPY commented 3 weeks ago

@sunfishcode

Do you know if the f16 emulation implementation that uses F32x4 performs correct rounding at each instruction?

I'm aware only about V8 implementation and both software and soft-hardware(F16C+AVX2) maintain correct rounding by going F16 -> F32 -> op -(rounding happens here)> F16 chain. I can imagine an optimization pass fusing several F16x8 operations in one F16 -> F32 conversion + several F32x8 operation and then going back F32 -> F16, but in this case operations will be 'more precise' and diverge from native F16 computations. I think it is more than acceptable for most of use cases of fp16 but will bring non-deterministic behavior.

@alexcrichton

I want to clarify that I had two basic assumptions in this issue: (1) that there's a wide performance gap on supporting an non-supporting CPUs, even with emulation, and (2) that the emulation support is so complex some engines won't want to add it.

I believe this proposal is on (1)-but-not-(2) ground. For example, here I wrote down lowering for fp16 instruction set for Arm64 + FEAT_FP16, AVX10/512 and F16C+AVX1/2. I believe it covers majority of consumer hardware used right now. Also, adding software emulation via float in V8 wasn't a big deal(~1.2KLOC) and I believe it maintainable cost for any Wasm VM. Despite that I still believe FP16 should be accompanied with emulation detection feature to allow user make informed decision in runtime.

Overall, I agree this thread drifted to more philosophical discussion. It is important but it definitely goes beyond the scope of the proposal.

sunfishcode commented 3 weeks ago

I'm aware only about V8 implementation and both software and soft-hardware(F16C+AVX2) maintain correct rounding by going F16 -> F32 -> op -(rounding happens here)> F16 chain.

If the (rounding happens here) step is the usual round-to-nearest-ties-to-even rounding, this would exhibit double rounding, where the real mathematical result of the op is first rounded to F32, and then rounded to F16, which produces different (and worse) results from rounding it to F16 in a single step.

I can imagine an optimization pass fusing several F16x8 operations in one F16 -> F32 conversion + several F32x8 operation and then going back F32 -> F16, but in this case operations will be 'more precise' and diverge from native F16 computations.

It may often be more accurate, but may still be less accurate sometimes if there's double rounding on the result.

Edit: The double-rounding concern turns out not to be relevant in this context. See my comment below.

Separately, if we chose to accept fusing optimizations, it raises an interesting question about how we might spec the nondeterminsm. Normally when we talk about nondeterminism, it's about having a set of permitted behaviors for an instruction. However with fusing, instructions conceptually return more information than their signatures describe. The behavior of an instruction becomes partly determined by any of the sequences of instructions that an instruction might be fused with.

rossberg commented 3 weeks ago

@alexcrichton:

I want to clarify that I had two basic assumptions in this issue: (1) that there's a wide performance gap on supporting an non-supporting CPUs, even with emulation, and (2) that the emulation support is so complex some engines won't want to add it. [...] My point then is that I think that the wasm CG needs to grapple directly with these questions in the abstract rather than this proposal specifically itself.

Thanks for summing this up, I agree with your analysis.

Personally, I think we should be conservative enough to avoid features for which both (1) and (2) apply. If that's the case, then it's clearly premature to add them to Wasm. There is more leeway if only one of them applies, and I could be convinced on a case-by-case basis.

In addition, the discussion about these problems tends to be overly narrowly focussed on Intel and Arm hardware only. Wasm has more customers than that with more diverse characteristics. That is part of its success story as well.

Hence, I also think that we need to make better use of profiles. With SIMD in particular, it is unlikely that many of its features will be available (or relevant) on certain hardware in any foreseeable future — or ever. If we had a SIMD-less profile, then it would be less problematic to progress faster, because it would help avoid the performance gap and avoid piling up massive complexity for customers of the Wasm standard for which the feature doesn't even matter.

@dtig:

The definition of success will probably vary based on who you ask. :)

Agreed. The use cases for SIMD are a relatively small but important niche, and it is difficult to decide how to weigh these factors.

alexcrichton commented 3 weeks ago

In the interest of being as concrete as I can, I would personally like to see some benchmarks of this proposal on various systems before seeing it advance to phase 2. To me phase 2 is where it's agreed that the proposal's shape is good and what's left is toolchains, implementations, and relatively minor details. One key aspect of this to me is that this proposal currently assumes that software emulation is not that slow and it's portable enough to implement everywhere. I think it would be best to have a more concrete benchmark to be more confident in these two assertions.

For example what I would like to see would be to take a reasonable kernel from some application which wants to use f16x8 and translate it to what the WebAssembly instructions would roughly be. Then that wasm would be hand-translated to C for aarch64 and x64 using native intrinsics. That would be benchmarked then on a number of systems:

aarch64 with ARMv8-A - this is probably going to be the baseline for everything else
x64 with AVX512-FP16
x64 with F16C and AVX2
x64 with F16C and AVX

This seems like it would exercise the proposed lowerings to give a sense of what the performance gap is. Personally I would like to see further benchmarking on systems such as aarch64 without ARMv8-A and x64 without F16C as well. My current impression (I'm no expert on these CPU extensions) is that these CPUs are 10+ years old, however, so I realize that this benchmarking may not be seen as useful to some.

To me this would inform the viability of the proposal. For example if all of the above combinations are reasonably performant then portability basically isn't a concern for this proposal. If x64 without F16C is 100x slower than x64 with F16C then I think the CG should agree "we don't care about portability to 10 year old CPUs" or something along those lines. If x64 with F16C and AVX is pretty abysmal compared to aarch64 with ARMv8-A, that's another discussion as well.

Basically I would like to see a concrete gut-check on what the performance is going to look like, at least as a rough ballpark. That would help inform at least me personally about the viability of this proposal in terms of portability to various systems.

alexcrichton commented 3 weeks ago

(I apologize for hitting send on the previous comment too soon, I edited it in-place so for those of you reading only email notifications you'll need to visit this thread on the web UI to see the latest version)

SPY commented 2 weeks ago

@sunfishcode

If the (rounding happens here) step is the usual round-to-nearest-ties-to-even rounding, this would exhibit double rounding, where the real mathematical result of the op is first rounded to F32, and then rounded to F16, which produces different (and worse) results from rounding it to F16 in a single step.

Thank you for bringing double rounding issue. Do I understand correct what we can be more emulation friendly(via F32) if we pick round-to-zero rounding mode as default?

sunfishcode commented 2 weeks ago

Yes, round-toward-zero would avoid the double rounding issue, though it would come the cost of reduced accuracy, and on CPUs where rounding modes are controlled by a control register, and dynamic rounding mode switching overhead and complexity.

sunfishcode commented 6 days ago

I was mistaken here about the double rounding here. This paper explains that double rounding is innocuous in cases like implementing f16 in terms of f32. (For historical context, this is the same paper that showed that the double rounding involved when using Math.fround to implement an f32 type using f64 operations in asm.js is innocuous).

My concern above about fusion still applies though. If we allow implementations to fuse f16 operations and avoid intermediate rounding to f16, instructions would need to be able to nondeterministically produce values outside the range of their result types.