Open ngzhian opened 3 years ago
IIRC @zeux had a use-case for these instructions.
It would be useful to consider f64x2
variants in the same proposal.
For
i32x4.trunc_sat_f32x4_u
, it will depend on implementation choice on x86/64:
- if AVX512 is available, same as above, x86/64 will return
0x8000000
in lanes for out of range or NaNs, ARM/ARM64 will return0
AVX512 version returns 0xFFFFFFFF
AVX512 version returns 0xFFFFFFFF
Corrected, thanks!
It would be useful to consider f64x2 variants in the same proposal.
relaxed i64x2.trunc_satf64x2{s,u}? We don't have these instructions in Simd128, so I think it is neater to separate them out.
relaxed i64x2.trunc_satf64x2{s,u}? We don't have these instructions in Simd128, so I think it is neater to separate them out.
The WebAssembly/simd#383 instructions
i32x4.trunc_sat_f64x2_u_zero and i32x4.trunc_sat_f64x2_s_zero?
Yes
Yeah this one is pretty fundamental for many workflows, e.g. in rendering domains it's common to store data as fixed-point integers for GPU consumption but to prepare this data you do some math in floating point and then convert to integer via smth like int(v * 65535.0f + 0.5f)
(assuming the value is known to be positive); the float->int truncation can be pretty hot based on the amount of other computation.
It would be nice to also include the rounding variants (on x64 assuming default rounding mode setup you can use cvtps2dq for rounding conversion and cvttps2dq for truncating; unsure what floating point environment is typically used in browser context, if it's undefined then rounding would require vroundps before cvttps).
What will be the exact recipe for relaxed i32x4.trunc_sat_f32x4_u
for x86/64 without AVX512? The comment at #247 suggests somewhat long version.
Is the following acceptable or the shorter version exists?
y = relaxed i32x4.trunc_sat_f32x4_u(x)
is lowered to:
MOVAPD xmm_y, xmm_x
MOVAPD xmm_tmp, [wasm_i32x4_splat(0x4f000000)]
CMPLTPS xmm_tmp, xmm_x
PAND xmm_tmp, xmm_x
PXOR xmm_y, xmm_tmp
CVTTPS2DQ xmm_y, xmm_y
PSLLD xmm_tmp, 7
PADDD xmm_y, xmm_tmp
it will be CVTTPS2DQ
. The relaxed version only guarantees output when inputs are < INT32_MAX and not NaN, which is exactly what CVTTPS2DQ is, which is available since SSE2.
@ngzhian The question was about the unsigned version, and IIUC we don't expect unsigned version to use just CVTTPS2DQ
alone.
Oh oops, sorry I missed that. Hm, then we should reconsider if we want the unsigned version in this. AVX512 is not supported by V8 yet.
IMO it is worth to have unsigned version, both for symmetry and because is it still faster on SSE4.1 than the non-relaxed unsigned version.
Should these instructions have _sat
in the name? In the SIMD MVP _sat stands for saturating, but these instructions don't specify exact behavior for out of range inputs.
What is PSLLD xmm_tmp, 7
for? I think it doesn't work for all cases, consider the input 2147483904.0
, this is larger that MAX_INT32, but fits int UINT32, so the result should be 2147483904
, or 0x80000100
The hex representation of 2147483904.0
is https://float.exposed/0x4f000001 and if we shift left by 7 it becomes 0x80000080
, which is wrong.
Agree, there was a mistake 😞 One more operation is needed to make PSLLD work: ADDPS xmm_tmp. xmm_tmp ; PSLLD xmm_tmp, 8
.
Agree, there was a mistake 😞 One more operation is needed to make PSLLD work: ADDPS xmm_tmp. xmm_tmp ; PSLLD xmm_tmp, 8.
Perfect, what a neat trick :) thanks!
Note: RISC-V V saturates for same width conversions. For f64x2->i32x4 it changes the vector type, and I think there's no guarantee that the top are zeroed.
On PowerPC VSX xscvdpsxws and xscvdpuxds perform trunc sat
I think I got the out of range results wrong in this description, ARM/ARM64 doesn't return 0, it saturates.
Relaxed versions of:
i32x4.trunc_sat_f32x4_s
i32x4.trunc_sat_f32x4_u
i32x4.trunc_sat_f64x2_s_zero
i32x4.trunc_sat_f64x2_u_zero
from Simd128. (Names undecided)
Convert f32x4/f64x2 to i32x4 with truncation (signed/unsigned). If the inputs are out of range or NaNs, the result is implementation-defined.
x86/64
relaxed
i32x4.trunc_sat_f32x4_s
= CVTTPS2DQ relaxedi32x4.trunc_sat_f32x4_u
= VCVTTPS2UDQ (AVX512), Simd128i32x4.trunc_sat_f32x4_u
otherwise (can be slightly optimized to ignore NaNs) relaxedi32x4.trunc_sat_f64x2_s_zero
= CVTTPD2DQ relaxedi32x4.trunc_sat_f64x2_u_zero
= VCVTTPD2UDQ (AVX512), Simd128i32x4.trunc_sat_f64x2_u_zero
ARM64
relaxed
i32x4.trunc_sat_f32x4_s
= FCVTZS relaxedi32x4.trunc_sat_f32x4_u
= FCVTZU relaxedi32x4.trunc_sat_f64x2_s_zero
= FCVTZS + SQXTN relaxedi32x4.trunc_sat_f64x2_u_zero
= FCVTZU + UQXTNARM NEON
relaxed
i32x4.trunc_sat_f32x4_s
= vcvt.S32.F32 relaxedi32x4.trunc_sat_f32x4_u
= vcvt.U32.F32 relaxedi32x4.trunc_sat_f64x2_s_zero
= vcvt.S32.F64 + vcvt.S32.F64 + vmov relaxedi32x4.trunc_sat_f64x2_u_zero
= vcvt.U32.F64 + vcvt.U32.F64 + vmovNote: On ARM MVE, double precision conversions require Armv8-M Floating-point Extension (FPv5), MVE can be implemented with or without such an extension.
simd128
respective non-relaxed versions
i32x4.trunc_sat_f32x4_s
,i32x4.trunc_sat_f32x4_u
,i32x4.trunc_sat_f64x2_s_zero
,i32x4.trunc_sat_f64x2_u_zero
.For
i32x4.trunc_sat_f32x4_s
:0x8000000
in lanes for out of range or NaNsFor
i32x4.trunc_sat_f32x4_u
:0xFFFFFFFF
in lanes for out of range or NaNs, if if AVX512 is available,0
otherwise (but require more instruction counts)For
i32x4.trunc_sat_f64x2_s_zero
:0x80000000
for out of range or NaNsFor
i32x4.trunc_sat_f64x2_u_zero
:0xFFFFFFFF
for out of range or NaNs if AVX512 is available,0
otherwiseConversion instructions are common, if the application can guarantee the input range we can get good performance on all architectures.