Proposal: FP16 value type and operations

SPY commented 9 months ago

Motivation

From a lab toy ML found its adoption in day-to-day usage and integrated in numerous web application now. To unblock full potential of AI-augmented application many initiative came to Web space recently. WebGPU proposal allows GPU enabled machines to perform better on AI front. Half-precision floating point number is a common choice for ML usage, because it provides better memory bandwidth and performance, and less precision is not that important there. JS Float16Array proposal improve integration of JS and WebGPU API. Wasm Memory control proposal aims to make GPU <-> Wasm interaction more efficient by reducing memory traffic. Modern hardware as well brings more native support for FP16: ARMv8-A NEON fp16 and F16C for example. I believe introduction of native support for half-precision floating point computation to WebAssembly will extend that could be achieved on that field as well and match and complement trends going on the hardware stage.

Potential solutions

Second-class support

We can mimic JS approach and introduce only 2 memory instruction for reading and writing f32 values in binary16 format.

f32.load_f16: [i32] -> [f32]
f32.store_f16: [i32, f32] -> []

It is easy to implement by VM, but only makes more efficient communication with memory regions shared with GPU somehow.

First-class support

For full scale support I suggest to refer a dedicated explainer for more details.

Briefly,

New value type: f16.
New shape for v128 values: f16x8.
Instructions for scalar arithmetic operations over f16 to be on parity with f32 values.
Vector instructions for f16x8 shape.

Despite the fact it is more invasive change it unblocks not only better interaction with GPU originated memory, but also could provide fallback for devices without GPU available for web-usage. Also, it could be used for smaller ML models: text processing, context inference, etc.

Conclusion

I believe second first-class support approach is more beneficial for ecosystem. All said above could be also applied to non-ML graphic applications.

bakkot commented 7 months ago

People pursuing this may wish to follow along at https://github.com/tc39/proposal-float16array/issues/12: x86 prior to Sapphire Rapids does not have a native way to do casts from float64 to float16, which means it would need to be done in software (though it can probably be done fairly cheaply, depending on your definition of "cheaply").

Not relevant if there's only casts from f32, but do note that f64 -> f32 -> f16 gives different results than f64 -> f16 (because of rounding), so it may make sense to have both, particularly as languages like C, C++ and Swift add native support for f16 and have casts from f64 -> f16.

syg commented 7 months ago

x86 prior to Sapphire Rapids

It's worth repeating that Sapphire Rapids is Xeon. There's nothing on Intel roadmaps AFAICT to bring this AVX512 extension to consumer chips.

WebAssembly / design