Open abrown opened 4 years ago
I'm not sure the benefit of having wasm instruction counts match native instruction counts is worth all the the extra work of making engines pattern match instruction sequences. In general, the higher-level the operation, the more information is available to the engine to enable it to make the best codegen choices possible. It is better to have an operation that takes three instructions to implement than to replace that operation everywhere with three operations if that would make codegen harder for some platform.
FYI AVX-512 and NEON have fabs() instructions.
(sorry, I know you're all tired of me and my "constants are awesome" comments)
fabsf/fneg are single instruction (two uops) with a memory load. You only need a single constant for all of these, which has a high bit set and low bits unset (sign mask).
fneg(v) = xor(v, -0.f)
fabs(v) = andnot(v, -0.f)
So this adds at most one 16b constant to the function that needs to perform this math. -0.f is amazingly useful in other cases, often it can be present in the user code as well to do operations like copysign.
If you don’t like constants, alternative lowering options:
fneg: xorps (to produce zero) + subps; 2 insn if you can’t find a ready to use zero register, 1 if you can fabs: fneg + maxps = 3 insn (2 if you already have 0 reg).
I am not sure of the exact NaN and negative zero semantics that the spec mandates for these so this may or may not be compliant.
Unfortunately, the alternative lowering options won't work since fneg and fabs are specced as purely binary operations, see https://webassembly.github.io/spec/core/exec/numerics.html#op-fneg so the nans need to be preserved.
In both cranelift (ignore the bitcasts) and v8, floating-point absolute value and floating-point negation are 3-instruction lowerings. I don't believe there is any better lowering than these (is there?) for this type of instruction, but I opened this issue to discuss the inefficiency: I would guess @dtig would say that these are in the class of useful instructions to have (and I agree) but predicting performance across all platforms might be more transparent if Emscripten, e.g., generated Wasm without these instructions. In other words, if Emscripten generated a sequence like
const + shift + and
instead offabs
it would be more obvious what the cost would be and Wasm compilers could still use a peephole optimization to convertconst + shift + and
to a single instruction on platforms that support this.