Open Maratyszcza opened 3 years ago
This cause issues in performance-sensitive numerical software, as it rarely can guarantee no denormalized numbers occus during computations.
Could you give specific examples of such software?
A commonly discussed example is audio processing. Around the time that wasm's MVP floating-point semantics were being designed, I asked a codec developer about this and they said it wasn't a major problem for them; they had a few specific places in their code where they handled it. And they said they preferred handling it that way because that meant they weren't dependent on platform-specific configuration mechanisms, and that they don't have to worry about breaking other software they happen to be sharing a thread with.
Since then, until now, I'm not aware of anyone having raised this issue in wasm.
Additional cons of exposing this as a control word, either opaque or virtualized include:
I think i like the virtualized control word better, the potential performance overhead will be minimal, since changing/reading the control world should be outside of a hot loop.
However, exposing FPU control word (both virtualized or opaque) will require us to change all existing SIMD floating point operators to read the current FPU control word to determine how to handle denormals. Engine can definitely optimize this away, but spec wise we would have to make this change.
Also, can you elaborate on how this fits into relaxed-simd: how is the the FPU control word and your suggested instructions introducing non-determinism or exposing differences in platforms?
The combination of these two means that this proposal would silently change the behavior of existing deployed wasm modules.
With some restrictions on when we can change the control word, we might be able to not change existing behavior. E.g. a block level syntax for control word (similar to a context manager and with
in Python).
(with disable_denormals
(code)
(call f))
f
in that case will can a different behavior than if it wasn't inside the disable_denormals
block.
We can also restrict that the FPU control word is reset on function calls, which is a bit weird. (Code generation wise, resetting FPU control word on function entry is more complicated than if we just generate a hardware read/write fpu word at block entry and exit. But this reset logic will be needed anyway for integration with JS.)
As this proposal is not isolated to SIMD operations and changes the semantics of existing instructions, transferring this issue to the main design repository with agreement from both @Maratyszcza and @ngzhian.
The above conversations focus largely around performance. At Unity one of the features that Unity develops is Deterministic Physics, which aims to solve the challenge that physics simulations are traditionally not guaranteed to be portable across different hardware/OS platforms, due to it being so easy in large software stack build systems to get different platforms to compile floating point simulation code differently (and the simulation timing loops to not match).
This deterministic simulation feature is standardized around DAZ/FTZ being active in the FPU control word on other platforms. This is due to ARM NEON not supporting denormals, so to standardize around the lowest common denominator.
This enables Unity developers for example to run physics simulations on a wide range of devices/OSes, executing simulations in parallel, being guaranteed that networked clients won't desynchronize; or they can serialize-deserialize state to continue simulations on different clients, or be guaranteed that e.g. physics-related research results will be reproducible outside the developer's machine.
Now with Wasm not supporting controlling FPU flags, the Deterministic Physics feature excludes the web platform, and people do semi-regularly report bugs that physics simulations do not work "properly" in WebAssembly. It would be great to achieve parity here.
The points that @sunfishcode lists as cons are quite striking for performance however. I think it would be ideal if, in addition to fine-grained control, Wasm Modules could define on the Module level what granularity they will need/use for the FPU control word support within the module. This way "pay-only-for-what-you-use" would be achieved and modules that can align with the JS defaults and don't need to control the FPU would not be impacted?
Denormalized floating-point numbers are the smallest in magnitude non-zero floating-point numbers. They were introduced to guarantee that if
a != b
thena - b != 0.0
. For denormalized numbers the floating-point biased exponent has its minimal value, and floating-point mantissa is interpreted differently, practically as a fixed-point number. Because of difference in formats between normalized and denormalized numbers, it is typical for processors to handle them in microcode or at greatly reduced performance. This cause issues in performance-sensitive numerical software, as it rarely can guarantee no denormalized numbers occus during computations. For this reason, SIMD instruction sets also offer mechanisms for disabling hardware support for denormalized numbers and treating them as zero instead. I suggest that we expose this capability in the Relaxed SIMD specification.Native instruction sets offer capability to disable denormalized numbers via FPU control word registers: the program reads the current value of the control word, modifies a few bits in it, and writes the new value into the FPU control word. When floating-point computation is finished, the program restored the original state by writing back the original value of the FPU control word.
I envision three possible ways to expose control over treating denormalized numbers as zeroes:
Expose opaque FPU control word
read_fpu_control(void) -> control_word
reads FPU control word and returns it as a special opaque data type.disable_denormals(control_word) -> control_word
modifies the bits of the control word to disable denormal processing.write_fpu_control(control_word)
writes the control word value stored in the opaque data type into the FPU control word register.Pros
Cons
Expose virtualized FPU control word
WAsm could provide
read_fpu_control() -> u32
instruction to read the FPU control word andwrite_fpu_control(u32)
instruction to write it back. The FPU control word is virtualized and has denormal control bits in the fixed position. V8 engine would have to do the translation between this virtualized control word and the native control word of the architecture.Pros
Cons
Specify handling of denormalized numbers as a function attribute
WAsm could provide a function attribute to disable handling of denormalized numbers within the function. V8 engine would disable handling of denormalized numbers in the function prologue and restore in the function epilogue. If the function transfers control to others, it would also have to restore handling of denormalized numbers before calling them and disable it back when returned.
Pros
Cons