Open yurydelendik opened 1 year ago
I wonder if it will be easier to (formally?) define algorithms and then use that as a base.
More analysis, V8 and SpiderMonkey somewhat similar algorithms for i32x4.relaxed_trunc_f32x4_u :
i32x4.relaxed_trunc_f32x4_u (v8):
0xed74b2397a0 20 c44178c23a01 vcmpps xmm15,xmm0,[r10], (lt)
0xed74b2397a8 28 c501dbf8 vpand xmm15,xmm15,xmm0
0xed74b2397ac 2c c4c179efc7 vpxor xmm0,xmm0,xmm15
; xmm0 keeps all values that are NaNs or >= 2^31, xmm15 the rest
0xed74b2397b1 31 c4417a5bff vcvttps2dq xmm15,xmm15
0xed74b2397b6 36 c5f858c0 vaddps xmm0,xmm0,xmm0
0xed74b2397ba 3a c5f972f008 vpslld xmm0,xmm0,8
0xed74b2397bf 3f c4c179fec7 vpaddd xmm0,xmm0,xmm15
i32x4.relaxed_trunc_f32x4_u (sm):
0000001A 44 0f 28 3d 2e 00 00 00 movapsx 0x0000000000000050, %xmm15
00000022 44 0f c2 f8 01 cmpps $0x01, %xmm0, %xmm15
00000027 66 44 0f db f8 pand %xmm0, %xmm15
0000002C 66 41 0f ef c7 pxor %xmm15, %xmm0
; xmm15 keeps all values that are non-NaNs and >= 2^31, xmm0 the rest
00000031 c5 fa 5b c0 vcvttps2dq %xmm0, %xmm0
00000035 45 0f 58 ff addps %xmm15, %xmm15
00000039 66 41 0f 72 f7 08 pslld $0x08, %xmm15
0000003F 66 41 0f fe c7 paddd %xmm15, %xmm0
00
The i32x4.relaxed_trunc_f64x2_u_zero pretty much identical:
i32x4.relaxed_trunc_f64x2_u_zero (v8):
0x301da9b00796 16 c4e37909c00b vroundpd xmm0,xmm0,0xb
0x301da9b0079c 1c 49bab0b7d50e01000000 REX.W movq r10,0x10ed5b7b0
0x301da9b007a6 26 c4c1795802 vaddpd xmm0,xmm0,[r10]
0x301da9b007ab 2b c4c178c6c788 vshufps xmm0,xmm0,xmm15,0x88
i32x4.relaxed_trunc_f64x2_u_zero (sm):
00000019 c4 e3 79 09 c0 0b vroundpd $0x0B, %xmm0, %xmm0
0000001F 44 0f 28 3d 29 00 00 00 movapsx 0x0000000000000050, %xmm15
00000027 66 41 0f 58 c7 addpd %xmm15, %xmm0
0000002C 41 0f c6 c7 88 shufps $0x88, %xmm15, %xmm0
The 0xFFFFFFFE
comes from float64 add operation:
> new Float64Array([-1 + 4503599627370496]).buffer
ArrayBuffer {
[Uint8Contents]: <fe ff ff ff ff ff 2f 43>,
byteLength: 8
}
For i32x4.relaxed_trunc_f64x2_u_zero, are we missing some instructions?
Marat's suggested codegen is:
VXORPD xmm_tmp, xmm_tmp, xmm_tmp
VMAXPD xmm_y, xmm_x, xmm_tmp
VMINPD xmm_y, xmm_y, [wasm_f64x2_splat(4294967295.0)]
VROUNDPD xmm_y, xmm_y, 0x0B
VADDPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
VSHUFPS xmm_y, xmm_y, xmm_xmp, 0x88
I don't see xorpd, vmaxpd, vminpd, in your analysis.
The vmaxpd should get rid of the -1
, so you won't get -1 + 4503599627370496
I don't see xorpd, vmaxpd, vminpd, in your analysis.
Oh yes, thanks for the pointer. Was looking at the saturated one. I think that algorithm is wrong. The way we spec i32x4.trunc_sat_f64x2_u_zero, it should either be 0 or 0xFFFFFF for out of range or NaNs, this is the AVX512F instruction VCVTTPD2UDQ (and FCVTZU + UQXTN on AArch64). Otherwise, it should fallback to SIMD trunc+saturate semantics. If we want to use that algorithm, we would have to change the allowed list of values. @Maratyszcza wdyt?
The algorithm used in V8 is wrong, it should fall back to SIMD trunc + saturate semantics. So really, the implementation is the same pre AVX512F. @dtig https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:v8/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.h;l=750;drc=2450f2f5d0ce0da9b8cf493c533f9528ff17bab6 will need to use the SIMD trunc implementation. @Maratyszcza fyi
Edit: had offline discussion with Marat, he prefers to add allow these constants, as long as they don't open up more fingerprinting.
Let's wait for https://github.com/WebAssembly/relaxed-simd/pull/144 to land (spec changes) then we can merge this.
Additional variants for i32x4.relaxed_trunc_f32x4_u and i32x4.relaxed_trunc_f64x2_u_zero based on algorithms implemented by SpiderMonkey and v8.