WebAssembly / relaxed-simd

Relax the strict determinism requirements of SIMD operations.
Other
43 stars 8 forks source link

relaxed i32x4.trunc_sat_f32x4_{s,u} i32x4.trunc_sat_f64x2_{s,u}_zero #21

Open ngzhian opened 3 years ago

ngzhian commented 3 years ago
  1. What are the instructions being proposed?

Relaxed versions of:

from Simd128. (Names undecided)

  1. What are the semantics of these instructions?

Convert f32x4/f64x2 to i32x4 with truncation (signed/unsigned). If the inputs are out of range or NaNs, the result is implementation-defined.

  1. How will these instructions be implemented? Give examples for at least x86-64 and ARM64. Also provide reference implementation in terms of 128-bit Wasm SIMD.

x86/64

relaxed i32x4.trunc_sat_f32x4_s = CVTTPS2DQ relaxed i32x4.trunc_sat_f32x4_u = VCVTTPS2UDQ (AVX512), Simd128 i32x4.trunc_sat_f32x4_u otherwise (can be slightly optimized to ignore NaNs) relaxed i32x4.trunc_sat_f64x2_s_zero = CVTTPD2DQ relaxed i32x4.trunc_sat_f64x2_u_zero = VCVTTPD2UDQ (AVX512), Simd128 i32x4.trunc_sat_f64x2_u_zero

ARM64

relaxed i32x4.trunc_sat_f32x4_s = FCVTZS relaxed i32x4.trunc_sat_f32x4_u = FCVTZU relaxed i32x4.trunc_sat_f64x2_s_zero = FCVTZS + SQXTN relaxed i32x4.trunc_sat_f64x2_u_zero = FCVTZU + UQXTN

ARM NEON

relaxed i32x4.trunc_sat_f32x4_s = vcvt.S32.F32 relaxed i32x4.trunc_sat_f32x4_u = vcvt.U32.F32 relaxed i32x4.trunc_sat_f64x2_s_zero = vcvt.S32.F64 + vcvt.S32.F64 + vmov relaxed i32x4.trunc_sat_f64x2_u_zero = vcvt.U32.F64 + vcvt.U32.F64 + vmov

Note: On ARM MVE, double precision conversions require Armv8-M Floating-point Extension (FPv5), MVE can be implemented with or without such an extension.

simd128

respective non-relaxed versions i32x4.trunc_sat_f32x4_s, i32x4.trunc_sat_f32x4_u, i32x4.trunc_sat_f64x2_s_zero, i32x4.trunc_sat_f64x2_u_zero.

  1. How does behavior differ across processors? What new fingerprinting surfaces will be exposed?

For i32x4.trunc_sat_f32x4_s:

For i32x4.trunc_sat_f32x4_u:

For i32x4.trunc_sat_f64x2_s_zero:

For i32x4.trunc_sat_f64x2_u_zero:

  1. What use cases are there?

Conversion instructions are common, if the application can guarantee the input range we can get good performance on all architectures.

Maratyszcza commented 3 years ago

IIRC @zeux had a use-case for these instructions.

It would be useful to consider f64x2 variants in the same proposal.

Maratyszcza commented 3 years ago

For i32x4.trunc_sat_f32x4_u, it will depend on implementation choice on x86/64:

  • if AVX512 is available, same as above, x86/64 will return 0x8000000 in lanes for out of range or NaNs, ARM/ARM64 will return 0

AVX512 version returns 0xFFFFFFFF

ngzhian commented 3 years ago

AVX512 version returns 0xFFFFFFFF

Corrected, thanks!

ngzhian commented 3 years ago

It would be useful to consider f64x2 variants in the same proposal.

relaxed i64x2.trunc_satf64x2{s,u}? We don't have these instructions in Simd128, so I think it is neater to separate them out.

Maratyszcza commented 3 years ago

relaxed i64x2.trunc_satf64x2{s,u}? We don't have these instructions in Simd128, so I think it is neater to separate them out.

The WebAssembly/simd#383 instructions

ngzhian commented 3 years ago

i32x4.trunc_sat_f64x2_u_zero and i32x4.trunc_sat_f64x2_s_zero?

Maratyszcza commented 3 years ago

Yes

zeux commented 3 years ago

Yeah this one is pretty fundamental for many workflows, e.g. in rendering domains it's common to store data as fixed-point integers for GPU consumption but to prepare this data you do some math in floating point and then convert to integer via smth like int(v * 65535.0f + 0.5f) (assuming the value is known to be positive); the float->int truncation can be pretty hot based on the amount of other computation.

It would be nice to also include the rounding variants (on x64 assuming default rounding mode setup you can use cvtps2dq for rounding conversion and cvttps2dq for truncating; unsure what floating point environment is typically used in browser context, if it's undefined then rounding would require vroundps before cvttps).

yurydelendik commented 3 years ago

What will be the exact recipe for relaxed i32x4.trunc_sat_f32x4_u for x86/64 without AVX512? The comment at #247 suggests somewhat long version.

Is the following acceptable or the shorter version exists?

ngzhian commented 3 years ago

it will be CVTTPS2DQ. The relaxed version only guarantees output when inputs are < INT32_MAX and not NaN, which is exactly what CVTTPS2DQ is, which is available since SSE2.

Maratyszcza commented 3 years ago

@ngzhian The question was about the unsigned version, and IIUC we don't expect unsigned version to use just CVTTPS2DQ alone.

ngzhian commented 3 years ago

Oh oops, sorry I missed that. Hm, then we should reconsider if we want the unsigned version in this. AVX512 is not supported by V8 yet.

Maratyszcza commented 3 years ago

IMO it is worth to have unsigned version, both for symmetry and because is it still faster on SSE4.1 than the non-relaxed unsigned version.

zeux commented 3 years ago

Should these instructions have _sat in the name? In the SIMD MVP _sat stands for saturating, but these instructions don't specify exact behavior for out of range inputs.

ngzhian commented 3 years ago

What is PSLLD xmm_tmp, 7 for? I think it doesn't work for all cases, consider the input 2147483904.0, this is larger that MAX_INT32, but fits int UINT32, so the result should be 2147483904, or 0x80000100 The hex representation of 2147483904.0 is https://float.exposed/0x4f000001 and if we shift left by 7 it becomes 0x80000080, which is wrong.

yurydelendik commented 3 years ago

Agree, there was a mistake 😞 One more operation is needed to make PSLLD work: ADDPS xmm_tmp. xmm_tmp ; PSLLD xmm_tmp, 8.

ngzhian commented 3 years ago

Agree, there was a mistake 😞 One more operation is needed to make PSLLD work: ADDPS xmm_tmp. xmm_tmp ; PSLLD xmm_tmp, 8.

Perfect, what a neat trick :) thanks!

ngzhian commented 2 years ago

Note: RISC-V V saturates for same width conversions. For f64x2->i32x4 it changes the vector type, and I think there's no guarantee that the top are zeroed.

ngzhian commented 2 years ago

On PowerPC VSX xscvdpsxws and xscvdpuxds perform trunc sat

ngzhian commented 2 years ago

I think I got the out of range results wrong in this description, ARM/ARM64 doesn't return 0, it saturates.