Alternative output for i32x4_relaxed_trunc.wast tests

yurydelendik commented 1 year ago

Additional variants for i32x4.relaxed_trunc_f32x4_u and i32x4.relaxed_trunc_f64x2_u_zero based on algorithms implemented by SpiderMonkey and v8.

yurydelendik commented 1 year ago

I wonder if it will be easier to (formally?) define algorithms and then use that as a base.

yurydelendik commented 1 year ago

More analysis, V8 and SpiderMonkey somewhat similar algorithms for i32x4.relaxed_trunc_f32x4_u :

i32x4.relaxed_trunc_f32x4_u (v8):

0xed74b2397a0    20  c44178c23a01         vcmpps xmm15,xmm0,[r10], (lt)
0xed74b2397a8    28  c501dbf8             vpand xmm15,xmm15,xmm0
0xed74b2397ac    2c  c4c179efc7           vpxor xmm0,xmm0,xmm15
; xmm0 keeps all values that are NaNs or >= 2^31, xmm15 the rest
0xed74b2397b1    31  c4417a5bff           vcvttps2dq xmm15,xmm15
0xed74b2397b6    36  c5f858c0             vaddps xmm0,xmm0,xmm0
0xed74b2397ba    3a  c5f972f008           vpslld xmm0,xmm0,8
0xed74b2397bf    3f  c4c179fec7           vpaddd xmm0,xmm0,xmm15

i32x4.relaxed_trunc_f32x4_u (sm):

0000001A  44 0f 28 3d 2e 00 00 00               movapsx 0x0000000000000050, %xmm15
00000022  44 0f c2 f8 01                        cmpps $0x01, %xmm0, %xmm15
00000027  66 44 0f db f8                        pand %xmm0, %xmm15
0000002C  66 41 0f ef c7                        pxor %xmm15, %xmm0
; xmm15 keeps all values that are non-NaNs and >= 2^31, xmm0 the rest
00000031  c5 fa 5b c0                           vcvttps2dq %xmm0, %xmm0
00000035  45 0f 58 ff                           addps %xmm15, %xmm15
00000039  66 41 0f 72 f7 08                     pslld $0x08, %xmm15
0000003F  66 41 0f fe c7                        paddd %xmm15, %xmm0
00

The i32x4.relaxed_trunc_f64x2_u_zero pretty much identical:

i32x4.relaxed_trunc_f64x2_u_zero (v8):

0x301da9b00796    16  c4e37909c00b         vroundpd xmm0,xmm0,0xb
0x301da9b0079c    1c  49bab0b7d50e01000000 REX.W movq r10,0x10ed5b7b0
0x301da9b007a6    26  c4c1795802           vaddpd xmm0,xmm0,[r10]
0x301da9b007ab    2b  c4c178c6c788         vshufps xmm0,xmm0,xmm15,0x88

i32x4.relaxed_trunc_f64x2_u_zero (sm):

00000019  c4 e3 79 09 c0 0b                     vroundpd $0x0B, %xmm0, %xmm0
0000001F  44 0f 28 3d 29 00 00 00               movapsx 0x0000000000000050, %xmm15
00000027  66 41 0f 58 c7                        addpd %xmm15, %xmm0
0000002C  41 0f c6 c7 88                        shufps $0x88, %xmm15, %xmm0

The 0xFFFFFFFE comes from float64 add operation:

> new Float64Array([-1 + 4503599627370496]).buffer
ArrayBuffer {
  [Uint8Contents]: <fe ff ff ff ff ff 2f 43>,
  byteLength: 8
}

ngzhian commented 1 year ago

For i32x4.relaxed_trunc_f64x2_u_zero, are we missing some instructions?

Marat's suggested codegen is:

VXORPD xmm_tmp, xmm_tmp, xmm_tmp
VMAXPD xmm_y, xmm_x, xmm_tmp
VMINPD xmm_y, xmm_y, [wasm_f64x2_splat(4294967295.0)]
VROUNDPD xmm_y, xmm_y, 0x0B
VADDPD xmm_y, xmm_y, [wasm_f64x2_splat(0x1.0p+52)]
VSHUFPS xmm_y, xmm_y, xmm_xmp, 0x88

I don't see xorpd, vmaxpd, vminpd, in your analysis.

The vmaxpd should get rid of the -1, so you won't get -1 + 4503599627370496

yurydelendik commented 1 year ago

I don't see xorpd, vmaxpd, vminpd, in your analysis.

I'm reading https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:v8/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.h;l=750;drc=2450f2f5d0ce0da9b8cf493c533f9528ff17bab6

ngzhian commented 1 year ago

Oh yes, thanks for the pointer. Was looking at the saturated one. I think that algorithm is wrong. The way we spec i32x4.trunc_sat_f64x2_u_zero, it should either be 0 or 0xFFFFFF for out of range or NaNs, this is the AVX512F instruction VCVTTPD2UDQ (and FCVTZU + UQXTN on AArch64). Otherwise, it should fallback to SIMD trunc+saturate semantics. If we want to use that algorithm, we would have to change the allowed list of values. @Maratyszcza wdyt?

ngzhian commented 1 year ago

The algorithm used in V8 is wrong, it should fall back to SIMD trunc + saturate semantics. So really, the implementation is the same pre AVX512F. @dtig https://source.chromium.org/chromium/chromium/src/+/refs/heads/main:v8/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.h;l=750;drc=2450f2f5d0ce0da9b8cf493c533f9528ff17bab6 will need to use the SIMD trunc implementation. @Maratyszcza fyi

Edit: had offline discussion with Marat, he prefers to add allow these constants, as long as they don't open up more fingerprinting.

ngzhian commented 1 year ago

Let's wait for https://github.com/WebAssembly/relaxed-simd/pull/144 to land (spec changes) then we can merge this.

WebAssembly / relaxed-simd

Alternative output for i32x4_relaxed_trunc.wast tests #140