Open penzn opened 1 year ago
This is a partial answer to @titzer's question about what the alternatives for "union" approach are. I haven't looked into the newer operations as close as the old ones.
Looked into this as a side effect of a different project.
True FMA can only be emulated via integer ops - the inputs need to be broken up into components, both operations performed, then result needs to be rounded and stored back into a float. It should take about 5 additions and 5 multiplication to get the result. This is expensive, though some existing SIMD instructions have even worse lowering (unsigned int conversions for example).
Edit: removed a couple paragraphs describing emulation of x86 floating-point min and max, since we already have those in the standard. Thanks to @abrown for pointing this out.
We have both deterministic variants in the spec already:
f32x4.relaxed_min
is either f32x4.min
or f32x4.pmin
f32x4.relaxed_max
is either f32x4.max
or f32x4.pmax
f64x2.relaxed_min
is either f64x2.min
or f64x2.pmin
f64x2.relaxed_max
is either f64x2.max
or f64x2.pmax
A bit of backstory for the discussion, some of this is opinion, but hopefully at least somewhat helpful.
I think it is useful to think about the operations as belonging to two categories: one dealing with floating point semantics and the other with other platform specifics (mostly integer). What this allows is separating questions regarding acceptable floating point output from other, arguably less tricky ones, like encoding invalid values when converting floats to ints. This division is somewhat subjective, but might become clearer with more concrete examples below.
Relaxed versions of existing 'integer' SIMD operations
i8x16.swizzle
, different treatment out of bound lane indiceslanselect
, different lane encoding in the maskSwizzle, laneselect, and float to int converstions in existing SIMD spec have Arm semantics, and new operations match them on Arm, while having different output on x86. Unlike floating point the differences are much more subjective (for example, should the invalid value be all zeros or all ones). It might be even possible to imagine a world where both flavors coexist. Emulating such operations is likely to be less tedious than trying to emulate an operations with better FP accuracy, plus they generally don't deviate from semantics already established for scalar operations.
Relaxed versions of existing floating point SIMD operations
fmin
, different treatment of +/- 0.0, NaN inputsfmax
, different treatment of +/- 0.0, NaN inputsThe gist is that x86 operations, unlike Arm operations, "short circuit" on
NaN
and disregard the sign of zero.Code that cannot rule out
NaN
inputs would likely expect more symmetric variants that what x86 is providing natively, and there are well known instruction sequences that would bring the behavior up to, say C++ spec, or one or the other IEEE standard. Obviously, the proposed operations have vastly better performance on x86 than the strict ones, but for code that doesn't rule outNaN
s there needs to be some mitigation (along the lines of what native libraries do), which still might be worth it from performance point of view.New operations
Just to summarize:
I think in general those have the same FP vs non-FP considerations as above, with a few extras (like single rounding FMA). The fact that those are new may not be an advantage.