WebAssembly / simd

Branch of the spec repo scoped to discussion of SIMD in WebAssembly
531 stars 43 forks source link

#372 Integer Sign/Zero Extension for {8,16}->{32,64} #395

Closed omnisip closed 3 years ago

omnisip commented 3 years ago


This proposal mirrors #290 to add new variants of existing widen instructions and extends the 32 and 64 widen instructions to include support from 16 and 8-bit integers. The practical use case for this is signal processing -- specifically audio and image processing, but the use cases for this are pretty large in general. For a non-image processing use case, these could be very helpful any time someone wants to convert an 8-bit value to a floating-point number. Currently, this requires multiple conversions steps between integers before converting to float, but modern architectures provide operations to convert from just about any integer size to another. Due to the non-binary relationship between 8 bits and 64 bits, this instruction will introduce new terminology that will replace the high/low terminology with a constant parameter immediate. This PR supersedes #372 to provide the implementation guidelines for this proposal.

Use Cases

Withdrawn instructions - i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128 - i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128 - i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128 - i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128

Performance and Portability Considerations

The principal implementation is that of a shuffle/swizzle and shift for signed data and merely a shuffle/swizzle for unsigned data. Analysis describing the efficacy of this proposal is described here and is demonstrated here for 8 to 32bit and here for 8 to 64 bit. There's a lot of room for compiler optimization depending on how the subsequent code operates. For instance, the primary advantage of the tbl approach (on ARM64) is when a mask already exists and doesn't require a load from memory. In other cases, it may make more sense to go the ushll or sshll routes. Whether or not a benefit is achieved depends on port utilization of the subsequent code and how much out of order and instruction-level parallelism that can be obtained. This does not appear to be the case with x64 chips which appear to gain a benefit so long as the number of shuffles is reduced. In such cases, if a compiler detects a load followed by a convert, it can immediately optimize it upstream with movzx**** or movsx**** directly to the target register. Such should provide the maximum instruction-level parallelism and minimal port usage. In any case where performance with this method does not exceed that of incremental conversions, the incremental conversion method may be used in its place. Similarly, any system or architecture that benefits from this conversion method over that of the incremental conversion method can use any of the masks described herein as if they were constants provided to the existing v128.swizzle operation.

Mapping To Common Instruction Sets

This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience. Compliant WebAssembly implementations do not have to follow the same code generation patterns.

Masks or Tables relevant to x64 and ARM Implementations

    mask_i32x4_i8x16_u0 = [0,255,255,255,
    mask_i32x4_i8x16_u1 = [4,255,255,255,
    mask_i32x4_i8x16_u2 = [8,255,255,255,
    mask_i32x4_i8x16_u3 = [12,255,255,255,
    mask_i32x4_i8x16_s0 = [255,255,255,0,
    mask_i32x4_i8x16_s1 = [255,255,255,4,
    mask_i32x4_i8x16_s2 = [255,255,255,8,
    mask_i32x4_i8x16_s3 = [255,255,255,12
Withdrawn lowerings ``` mask_i64x2_i8x16_u0 = [0,255,255,255,255,255,255,255, 1,255,255,255,255,255,255,255] mask_i64x2_i8x16_u1 = [2,255,255,255,255,255,255,255, 3,255,255,255,255,255,255,255] mask_i64x2_i8x16_u2 = [4,255,255,255,255,255,255,255, 5,255,255,255,255,255,255,255] mask_i64x2_i8x16_u3 = [6,255,255,255,255,255,255,255, 7,255,255,255,255,255,255,255] mask_i64x2_i8x16_u4 = [8,255,255,255,255,255,255,255, 9,255,255,255,255,255,255,255] mask_i64x2_i8x16_u5 = [10,255,255,255,255,255,255,255, 11,255,255,255,255,255,255,255] mask_i64x2_i8x16_u6 = [12,255,255,255,255,255,255,255, 13,255,255,255,255,255,255,255] mask_i64x2_i8x16_u7 = [14,255,255,255,255,255,255,255, 15,255,255,255,255,255,255,255] mask_i64x2_i8x16_s0 = [255,255,255,255,255,255,255,0 255,255,255,255,255,255,255,1] mask_i64x2_i8x16_s1 = [255,255,255,255,255,255,255,2, 255,255,255,255,255,255,255,3] mask_i64x2_i8x16_s2 = [255,255,255,255,255,255,255,4, 255,255,255,255,255,255,255,5] mask_i64x2_i8x16_s3 = [255,255,255,255,255,255,255,6, 255,255,255,255,255,255,255,7] mask_i64x2_i8x16_s4 = [255,255,255,255,255,255,255,8, 255,255,255,255,255,255,255,9] mask_i64x2_i8x16_s5 = [255,255,255,255,255,255,255,10, 255,255,255,255,255,255,255,11] mask_i64x2_i8x16_s6 = [255,255,255,255,255,255,255,12, 255,255,255,255,255,255,255,13] mask_i64x2_i8x16_s7 = [255,255,255,255,255,255,255,14, 255,255,255,255,255,255,255,15] mask_i64x2_i16x8_u0 = [0,1,255,255,255,255,255,255, 2,3,255,255,255,255,255,255] mask_i64x2_i16x8_u1 = [4,5,255,255,255,255,255,255, 6,7,255,255,255,255,255,255] mask_i64x2_i16x8_u2 = [8,9,255,255,255,255,255,255, 10,11,255,255,255,255,255,255] mask_i64x2_i16x8_u3 = [12,13,255,255,255,255,255,255, 14,15,255,255,255,255,255,255] mask_i64x2_i16x8_s0 = [255,255,255,255,255,255,0,1, 255,255,255,255,255,255,2,3] mask_i64x2_i16x8_s1 = [255,255,255,255,255,255,4,5, 255,255,255,255,255,255,6,7] mask_i64x2_i16x8_s2 = [255,255,255,255,255,255,8,9, 255,255,255,255,255,255,10,11] mask_i64x2_i16x8_s3 = [255,255,255,255,255,255,12,13, 255,255,255,255,255,255,14,15] mask_i64x2_i8x16_condensed_s0 = [255,255,255,0,255,255,255,1,255,255,255,2,255,255,255,3] mask_i64x2_i8x16_condensed_s1 = [255,255,255,4,255,255,255,5,255,255,255,6,255,255,255,7] mask_i64x2_i8x16_condensed_s2 = [255,255,255,8,255,255,255,9,255,255,255,10,255,255,255,11] mask_i64x2_i8x16_condensed_s3 = [255,255,255,12,255,255,255,13,255,255,255,14,255,255,255,15] mask_i64x2_i16x8_condensed_s0 = [255,255,0,1,255,255,2,3,255,255,4,5,255,255,6,7] mask_i64x2_i16x8_condensed_s1 = [255,255,8,9,255,255,10,11,255,255,12,13,255,255,14,15] *255 can be replaced with 128 where necessary or reasonable.* ```

x86/x86-64 processors with AVX instruction set

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

# When c=0
        vpmovzxbd xmm_out, xmm_a # for mask c=0
# When c=1
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_u1 
# When c=2
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_u2 
# When c=3
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_u3 

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

# When c=0
        vpmovsxbd xmm_out, xmm_a # for mask c=0
# When c=1
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_s1
        vpsrad   xmm_out, xmm_out, 24 
# When c=2
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_s2
        vpsrad   xmm_out, xmm_out, 24 
# When c=3
        vpshufb  xmm_out, xmm_a, mask_i32x4_i8x16_s3
        vpsrad   xmm_out, xmm_out, 24 
Withdrawn lowerings ### i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128 ```assembly # When c=0 vpmovzxbq xmm_out, xmm_a # When c=1..7 vpshufb xmm_out, xmm_a, mask_i64x2_i8x16_u$c ``` ### i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128 ```assembly # When c=0 vpmovsxbq xmm_out, xmm_a # When c=1 vpsrad xmm_out, xmm_a, 16 vpmovsxbq xmm_out, xmm_out # When c=[1,3,5,7] vpshufb xmm_tmp, xmm_a, mask_i64x2_i8x16_condensed_s$(c-1) vpsrad xmm_out, xmm_tmp, 24 vpsrad xmm_tmp, xmm_tmp, 31 vpunpckhdq xmm_out, xmm_out, xmm_tmp # When c=[2,4,6] vpshufb xmm_tmp, xmm_a, mask_i64x2_i8x16_condensed_s$(c) vpsrad xmm_out, xmm_tmp, 24 vpsrad xmm_tmp, xmm_tmp, 31 vpunpckldq xmm_out, xmm_out, xmm_tmp ``` ### i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128 ```assembly # When c=0 vpmovzxwq xmm_out, xmm_a # When c=1..3 vpshufb xmm_out, xmm_a, mask_i64x2_i16x8_u$c ``` ### i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128 ```assembly # When c=0 vpmovswbq xmm_out, xmm_a # When c=[1,3] vpshufb xmm_tmp, xmm_a, mask_i64x2_i8x16_condensed_s$(c-1) vpsrad xmm_out, xmm_tmp, 16 vpsrad xmm_tmp, xmm_tmp, 31 vpunpckhdq xmm_out, xmm_out, xmm_tmp # When c=[2] vpshufb xmm_tmp, xmm_a, mask_i64x2_i8x16_condensed_s$(c) vpsrad xmm_out, xmm_tmp, 16 vpsrad xmm_tmp, xmm_tmp, 31 vpunpckldq xmm_out, xmm_out, xmm_tmp ```

x86/x86-64 processors with SSE4 instruction set

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

        movdqa   xmm_out, xmm_a
# when c=0
       pmovzxbd  xmm_out, xmm_out
# when c=1..3
        pshufb  xmm_out, mask_i32x4_i8x16_u$c

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

        movdqa   xmm_out, xmm_a
# when c=0
       pmovsxbd  xmm_out, xmm_out
# when c=1..3
        pshufb  xmm_out, mask_i32x4_i8x16_s$c
        psrad   xmm_out, 24
Withdrawn lowerings ### i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128 ```assembly movdqa xmm_out, xmm_a # when c=0 pmovsxbq xmm_out, xmm_out # when c=1..7 pshufb xmm_out, mask_i64x2_i8x16_u$c ``` ### i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128 ```assembly # Option 1 (best performance in most cases) pmovsxbq xmm_out, mem_argument(base + 2*c) # Option 2 # when c=0 pmovsxbq xmm_out, xmm_a # when c=1 movdqa xmm_out, xmm_a psrld xmm_out, 16 pmovsxwq xmm_out, xmm_a # When c=[1,3,5,7] movdqa xmm_tmp, xmm_a vpshufb xmm_tmp, mask_i64x2_i8x16_condensed_s$(c-1) movdqa xmm_out, xmm_tmp vpsrad xmm_out, xmm_tmp, 24 vpsrad xmm_tmp, xmm_tmp, 31 vpunpckhdq xmm_out, xmm_out, xmm_tmp # When c=[2,4,6] movdqa xmm_tmp, xmm_a vpshufb xmm_tmp, mask_i64x2_i8x16_condensed_s$(c-1) movdqa xmm_out, xmm_tmp vpsrad xmm_out, xmm_tmp, 24 vpsrad xmm_tmp, xmm_tmp, 31 vpunpckldq xmm_out, xmm_out, xmm_tmp # Option 3 Spill and Load # This may provide better performance than Option 2 if you're iterating through the whole register # and you can't optimize for reuse of the original shuffle -- punpck{l,h}dq movdqa xmmword ptr[rsp+XXXX], xmm_a pmovsxwq xmm_out, dword ptr[rsp+XXXX+2*c] ``` ### i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128 ```assembly movdqa xmm_out, xmm_a pshufb xmm_out, mask_i64x2_i16x8_u$c ``` ### i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128 ```assembly # Option 1 (best performance in most cases) pmovsxwq xmm_out, mem_argument(base + 2*c) # Option 2 # When c=0 pmovsxwq xmm_out, xmm_a # When c=[1,3] movdqa xmm_tmp, xmm_a vpshufb xmm_tmp, mask_i64x2_i16x8_condensed_s$(c-1) movdqa xmm_out, xmm_tmp vpsrad xmm_out, xmm_tmp, 16 vpsrad xmm_tmp, xmm_tmp, 31 vpunpckhdq xmm_out, xmm_out, xmm_tmp # When c=[2] movdqa xmm_tmp, xmm_a vpshufb xmm_tmp, mask_i64x2_i16x8_condensed_s$(c) movdqa xmm_out, xmm_tmp vpsrad xmm_out, xmm_tmp, 16 vpsrad xmm_tmp, xmm_tmp, 31 vpunpckldq xmm_out, xmm_out, xmm_tmp # when c=1..3 (with just pure random access and no other conversions needed) movdqa xmm_out, xmm_a psrldq xmm_out, c*4 pmovsxwq xmm_out, xmm_a # Option 3 Spill and Load # This may provide better performance than Option 2 if you're iterating through the whole register # and you can't optimize for reuse of the original shuffle -- punpck{l,h}dq movdqa xmmword ptr[rsp+XXXX], xmm_a pmovsxwq xmm_out, dword ptr[rsp+XXXX+2*c] ```

x86/x86-64 processors with SSE2 instruction set

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

# all cases
       pxor         xmm_tmp, xmm_tmp
       movdqa   xmm_out, xmm_a
# case c=0
        punpcklbw       xmm_out, xmm_tmp        
        punpcklwd       xmm_out, xmm_tmp              
 # case c=1
        punpcklbw       xmm_out, xmm_tmp        
        punpckhwd       xmm_out, xmm_tmp 
# case c=2
        punpckhbw       xmm_out, xmm_tmp        
        punpcklwd       xmm_out, xmm_tmp        
# case c=3
        punpckhbw       xmm_out, xmm_tmp        
        punpckhwd       xmm_out, xmm_tmp       

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

# all cases
# xmm_out can be uninitialized since we're discarding the values anyway
# case c=0
        punpcklbw       xmm_out, xmm_a         
        punpcklwd       xmm_out, xmm_out
        psrad               xmm_out, 24
 # case c=1
        punpcklbw       xmm_out, xmm_a         
        punpckhwd       xmm_out, xmm_out
        psrad               xmm_out, 24
# case c=2
        punpckhbw       xmm_out, xmm_a         
        punpcklwd       xmm_out, xmm_out
        psrad               xmm_out, 24
# case c=3
        punpckhbw       xmm_out, xmm_a         
        punpckhwd       xmm_out, xmm_out
        psrad                xmm_out, 24
Withdrawn lowerings ### i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128 ```assembly # all cases movdqa xmm_out, xmm_a pxor xmm_tmp, xmm_tmp # case c=0 punpcklbw xmm_out, xmm_tmp punpcklwd xmm_out, xmm_tmp punpckldq xmm_out, xmm_tmp # case c=1 punpcklbw xmm_out, xmm_tmp punpcklwd xmm_out, xmm_tmp punpckhdq xmm_out, xmm_tmp # case c=2 punpcklbw xmm_out, xmm_tmp punpckhwd xmm_out, xmm_tmp punpckldq xmm_out, xmm_tmp # case c=3 punpcklbw xmm_out, xmm_tmp punpckhwd xmm_out, xmm_tmp punpckhdq xmm_out, xmm_tmp # case c=4..7 # repeat c=0..3 with punpckhbw instead of punpcklbw ``` ### i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128 ```assembly # LLVM-MCA seems to suggest a spill for these cases if the chip really only supports SSE2 movaps xmmword ptr [safe_memory_location], xmm_a movsx r8, byte ptr [safe_memory_location+c*2] # use which ever 64 bit register makes sense movsx rcx, byte ptr [safe_memory_location+c*2+1] # use which ever 64 bit register makes sense movq xmm_tmp, rcx movq xmm_out, r8 punpcklqdq xmm_out, xmm_tmp ``` ### i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128 ```assembly # all cases movdqa xmm_out, xmm_a pxor xmm_tmp, xmm_tmp # case c=0 punpcklwd xmm_out, xmm_tmp punpckldq xmm_out, xmm_tmp # case c=1 punpcklwd xmm_out, xmm_tmp punpckhdq xmm_out, xmm_tmp # case c=2 punpckhwd xmm_out, xmm_tmp punpckldq xmm_out, xmm_tmp # case c=3 punpckhwd xmm_out, xmm_tmp punpckhdq xmm_out, xmm_tmp ``` ### i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128 ```assembly # case c=0 punpcklwd xmm_out, xmm_a movdqa xmm_high, xmm_out psrad xmm_out, 16 psrad xmm_high, 31 punpckldq xmm_out, xmm_high # case c=1 punpcklwd xmm_out, xmm_a movdqa xmm_high, xmm_out psrad xmm_out, 16 psrad xmm_high, 31 punpckhdq xmm_out, xmm_high # case c=2 punpckhwd xmm_out, xmm_a movdqa xmm_high, xmm_out psrad xmm_out, 16 psrad xmm_high, 31 punpckldq xmm_out, xmm_high # case c=3 punpckhwd xmm_out, xmm_a movdqa xmm_high, xmm_out psrad xmm_out, 16 psrad xmm_high, 31 punpckhdq xmm_out, xmm_high ```

on ARM64

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

### Option 1
        tbl vOut.16B, { vA.16B } mask_i32x4_i8x16_u$c.16B
### Option 2
        ushll{2}    vOut.4S,  vA.{4H,8H}, #0
        ushll{2}    vOut.2D,  vOut.{2S,4S}, #0

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

### Option 1
        tbl vOut.16B, { vA.16B } vMask_i64x2_i8x16_s$c.16B
        sshr    vOut.2S, #56
### Option 2
        sshll{2}    vOut.4H, vA.{8B,16B}, #0
        sshll{2}    vOut.4S,  vOut.4H, #0
Withdrawn lowerings ### i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128 ```assembly ### Option 1 tbl vOut.16B, { vA.16B } vMask_i64x2_i8x16_u$c.16B ### Option 2 ushll{2} vOut.4H, vA.{8B,16B}, #0 ushll{2} vOut.4S, vOut.{4H,8H}, #0 ushll{2} vOut.2D, vOut.{2S,4S}, #0 ``` ### i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128 ```assembly ### Option 1 tbl vOut.16B, { vA.16B }, vMask_i64x2_i8x16_s$c.16B sshr vOut.2S, #56 ### Option 2 sshll{2} vOut.4H, vA.{8B,16B}, #0 sshll{2} vOut.4S, vOut.{4H,8H}, #0 sshll{2} vOut.2D, vOut.{2S,4S}, #0 ``` ### i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128 ```assembly ### Option 1 tbl vOut.16B, { vA.16B } vMask_i64x2_i16x8_u$c.16B ### Option 2 ushll{2} vOut.4S, vA.{4H,8H}, #0 ushll{2} vOut.2D, vOut.{2S,4S}, #0 ``` ### i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128 ```assembly ### Option 1 tbl vOut.16B, { vA.16B } vMask_i64x2_i8x16_s$c.16B sshr vOut.2S, #48 ### Option 2 sshll{2} vOut.4S, vA.{4H,8H}, #0 sshll{2} vOut.2D, vOut.{2S,4S}, #279 ```

on ARMv7 with NEON

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128

# first lower 64 of mask and input vector
# assuming dLow/DHigh correspond to a Q
        tbl dOutLow, { dALow } (mask_i32x4_i8x16_u$c & 0xffffffffffffffff)
# second upper 64 of mask and input vector
        tbl dOutHigh, { dAHigh } (mask_i32x4_i8x16_u$c >> 64)

i32x4.widen_i8x16_s(v128: a, ImmLaneIdx4: c) -> v128

# assuming dLow/DHigh correspond to a Q
# lower 64
        tbl dOutLow, { dALow } (mask_i32x4_i8x16_s$c & 0xffffffffffffffff)
# upper 64
        tbl dOutHigh, { dAHigh } (mask_i32x4_i8x16_s$c >> 64)
        vshr.s64        qOut, qOut, #24
Withdrawn lowerings ### i64x2.widen_i8x16_u(v128: a, ImmLaneIdx8: c) -> v128 ```assembly # assuming dLow/DHigh correspond to a Q # lower 64 tbl dOutLow, { dALow } (mask_i64x2_i8x16_u$c & 0xffffffffffffffff) # upper 64 tbl dOutHigh, { dAHigh } (mask_i64x2_i8x16_u$c >> 64) ``` ### i64x2.widen_i8x16_s(v128: a, ImmLaneIdx8: c) -> v128 ```assembly # assuming dLow/DHigh correspond to a Q # lower 64 tbl dOutLow, { dALow } (mask_i64x2_i8x16_s$c & 0xffffffffffffffff) # upper 64 tbl dOutHigh, { dAHigh } (mask_i64x2_i8x16_s$c >> 64) vshr.s64 qOut, qOut, #56 ``` ### i64x2.widen_i16x8_u(v128: a, ImmLaneIdx4: c) -> v128 ```assembly # assuming dLow/DHigh correspond to a Q # lower 64 tbl dOutLow, { dALow } (mask_i64x2_i16x8_u$c & 0xffffffffffffffff) # upper 64 tbl dOutHigh, { dAHigh } (mask_i64x2_i16x8_u$c >> 64) ``` ### i64x2.widen_i16x8_s(v128: a, ImmLaneIdx4: c) -> v128 ```assembly # assuming dLow/DHigh correspond to a Q # lower 64 tbl dOutLow, { dALow } (mask_i64x2_i16x8_s$c & 0xffffffffffffffff) # upper 64 tbl dOutHigh, { dAHigh } (mask_i64x2_i16x8_s$c >> 64) vshr.s64 qOut, qOut, #48 ```
omnisip commented 3 years ago

Extra notes about performance and implementation (left for posterity)

ARM64 with int64 ### On ARM: According to llvm-mca tbl/sshr should have identical performance to that of 2 sshlls since they use the exact same ports with the same latency and the same number of instructions. This suggests there's a potential benefit for 8 to 64bit case with signed integers.
Signed Data on x64 without SSE4 On architectures that don't support SSE4, it can make sense to spill the vector memory, load the values into individual registers, move it back to vectors, and unpack. Since machines lacking SSE4 seems to be such an edge case, this should provide reasonably good fallback behavior. Example: ```assembly movaps xmmword ptr [rsp - 128], xmm0 movsx r8, byte ptr [rsp - 128] movsx rcx, byte ptr [rsp - 127] movq xmm0, rcx movq xmm1, r8 punpcklqdq xmm1, xmm0 # xmm1 = xmm1[0],xmm0[0] ```
omnisip commented 3 years ago

Updated the assembly above to provide comments on spill and load options as well for using pmovsxbq since x64 lacks psraq without AVX512.

omnisip commented 3 years ago

And another option which has the potential to double the signed 64bit output depending on how it's used:

        vpshufb xmm0, xmm0, xmmword ptr [rip + .LCPI9_7] # xmm0 = zero,zero,zero,xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,zero,zero,zero,zero,zero
        vpsrad  xmm1, xmm0, 24
        vpsrad  xmm0, xmm0, 31
        vpunpckldq      xmm0, xmm0, xmm1 
omnisip commented 3 years ago

The instruction set checks for ARMv7 with NEON is done. The method seems to port nicely.

Example from Godbolt:

        vtbl.8  d3, {d1}, d16
        vtbl.8  d2, {d0}, d17
        vshr.s64        q0, q1, #56

Will update the rest of the documentation for this PR later today. Should cover ARM64, ARMv7+Neon, x64/SSE4 (including SSSE3), AVX, and SSE2.

Updated: This is done.

omnisip commented 3 years ago

hey @ngzhian,

On our call @Maratyszcza had a question about v8 preserving constants with respect to this proposal. This behavior appears to exist on x64 for any constant parameter, and v8 will pregenerate them at the beginning of the code block. Will this also apply for ARM7 and ARM64? The biggest benefit with respect to this proposal is making sure that the masks that are used are only loaded once.

ngzhian commented 3 years ago

These constant mask will remain in registers and be reused as long as they are not spilled. Same for ARM7 and ARM64.

omnisip commented 3 years ago

These constant mask will remain in registers and be reused as long as they are not spilled. Same for ARM7 and ARM64.

That's awesome and should make this really efficient.

omnisip commented 3 years ago

@ngzhian @Maratyszcza

It turns out that the TBL approach with SSHR can be more efficient than SSHLL when the algorithm is adjusted, such that SSHR is only called once. According to the ARM Cortex-A76 software optimization guide shift operations can only occur in one instruction per cycle, but TBL operations with two table vectors can occur twice per cycle. Whether Cortex-A76 is an accurate testbed is to be decided -- however -- it gave me an idea for a new implementation that uses fewer instructions for signed conversion and leverages the performances of TBL for these integer conversions. The biggest difference in the implementation is the mask that's required for signed conversion.

Here's a Godbolt example.

ngzhian commented 3 years ago

Thanks for your suggestion and the detailed implementation guide. Couple of notes:

omnisip commented 3 years ago


First and foremost, thanks for looking at this. I know this is a doozy of a proposal. There are 24 variants masquerading as 6 instructions even if most of them are masks. I'm going to take your questions a bit out of order, so you can understand how this came to be, and what the benefits will be.

  • Which instruction do you think will see the greatest speed up from having a dedicated instruction, rather than composing existing ones?

As it stands today, every integer conversion requires stepwise conversion in the WASM SIMD instruction set. Thus the initial premise removes the minimum required instructions to go from 8 to 64 for 8 results from 14 WASM SIMD instructions to 8. For 8 to 32, it's 4 instead of 6. This proposal can neatly do that for unsigned values with PSHUFB/TBL equivalents assuming masks are present. For signed data types, the underlying implementation is equally efficient on x86/x64 even though there are more instructions by virtue of completely different port usage. And, if there's even a remote possibility that the ARM support can be implemented like this for signed data types, ARM will receive all of the same benefits as well. While all of those cases get clear direct benefits when in vectors, the largest benefits come from the operations that come directly from memory. V8 leverages this functionality for x64 and can do an in-flight LoadTransform (see here) for single step integer type conversion, but can't do it for multi-step. With these new instructions, load transformation could apply universally for x86/x64 without any interaction from the programmer and without the need for masks while still giving a very performant solution for ARM.

  • These instructions (especially the 8->32 unsigned ones), look a lot like swizzle/shuffles (the signed ones need another arithmetic shift right). V8 has support for shuffles and pattern matching shuffle immediates.

This is correct (mostly) with a couple of caveats. It doesn't take advantage of any of the underlying LoadTransform stuff listed above, and it has to deal with the less than efficient swizzle implementation that doesn't recognize that the input parameters themselves are constant. If we can come up with an optimization like proposed in #403, swizzle wouldn't be a bad option, but it'll never be as good as a load and shuffle or the loadtransform above.

  • A lot of the masks will not be easy to generate, it will likely end up like eor x, y + replace lanes as we don't have load constants from memory.

I have some ideas on how to make loading memory constants work nicely inside the current architecture of v8 with minimal changes to the code. I just need some time to flesh them out a bit. For runtime generation, there's a bunch of ways to do it that are better than individual inserts. If you need some samples, please let me know. Even the insert strategy isn't so bad as long as the masks are only generated once and reused by subsequent calls.

  • Can you add more specific links to projects that will benefit from this instruction? Linking to specific snippet of code will make it more clear.

Yes. I'll update this thread with some examples when I have a minute.

omnisip commented 3 years ago
penzn commented 3 years ago

Are any of those compiling to wasm or on the way to compiling?

omnisip commented 3 years ago

Are any of those compiling to wasm or on the way to compiling?

Yes sir. Simdpp (header only is up first). @tlively is there preprocessor macro to detect emscripten / wasm implementation?

tlively commented 3 years ago

Yep, you can check for the __wasm_simd128__ macro.

omnisip commented 3 years ago


This looks ideal for the constants needed for this proposal on x64 -- https://source.chromium.org/chromium/chromium/src/+/master:v8/src/codegen/external-reference.cc;drc=8b5f6ef28dd93e62fc1a75bc7a812af1b33777ec;bpv=1;bpt=1;l=479?gsn=address_of_double_neg_constant&gs=kythe%3A%2F%2Fchromium.googlesource.com%2Fchromium%2Fsrc%3Flang%3Dc%252B%252B%3Fpath%3Dsrc%2Fv8%2Fsrc%2Fcodegen%2Fexternal-reference.cc%236229zYZpWqH4shLqc-Pfle4euv19xiK0cKdZt79NW6k&gs=kythe%3A%2F%2Fchromium.googlesource.com%2Fchromium%2Fsrc%3Flang%3Dc%252B%252B%3Fpath%3Dsrc%2Fv8%2Fsrc%2Fcodegen%2Fexternal-reference.h%239x9xoVRhgNiMm6TjRisDaj9Z7o4x-2L0d9zXYL5Quj8

I'm assuming there's a similar mechanism for ARM?

ngzhian commented 3 years ago

Yup external references (the link you sent) are arch-independent.

ngzhian commented 3 years ago

@omnisip you mentioned you will get some numbers if this is prototype. Which instruction are you planning to make use of? And for which architecture. This is a lot to prototype.

ngzhian commented 3 years ago

Also, simdpp is a simd header library, I wouldn't consider it a use case according to our inclusion criteria (since as a library it necessarily includes more instructions.) AOM and Xiph uses 8->32 and 16->32 AFAICT, did not find any X->64 usages there.

omnisip commented 3 years ago

@omnisip you mentioned you will get some numbers if this is prototype. Which instruction are you planning to make use of? And for which architecture. This is a lot to prototype.

The most interesting instructions to me are the 8 to 32s. I added the 64 bit variants for orthogonality. The 8 to 32 cases for unsigned stands out most since on x64 I have to use swizzle four times yielding at least 8 shuffles, 4 movs and 4 adds. With the shuffle method assuming I'm using a second vector with zeros, it doesn't look much better.

I have a prefix sum / scan calculation that leverages quite a bit of this with simdpp even if it's not posted yet. This will be a WASM first library for ssim calculation.

ngzhian commented 3 years ago

Before any further action, I would like to see more support for this set of instructions, e.g. community members saying that this is useful for them. It will also be better if existing use cases can immediately benefit if we have this set of instruction, rather than new developments. For the reasons above, I suggest we mark this set of instructions as post-mvp, and focus on locking down our instruction set.

omnisip commented 3 years ago

@ngzhian -- Please see the meeting notes from 11/13/2020 where this was discussed in detail. Specifically, this proposal is necessary because the conversions on x64 from 8 to 32bit are difficult and expensive to perform with our existing instruction set. No option exists without at least two shuffles ops for any conversion, and all of the widen high variants require at least 2 (alignr/psrlq + pmovsx...). For every conversion from 8 to 32 on x64, it takes a minimum of 3 shuffle ops to get from 1x8x16 to 2x16x8, then another 6 to go from 2x16x8 to 4x32x4 yielding 9 instructions and 9 shuffle ops.

The other options -- swizzle and shuffle are worse since no pattern matches will occur for these. If that wasn't problematic, it gets really messy with signed conversion cases -- where you end up with a shuffle like this: shuffle(0,16,16,16,1,17,17,17,17,2,18,18,18,3,19,19,19); (the second vector would be the result of determining if the first vector was less than 0). This turns out to be okay on ARM where TBL can span 2 vectors and perform that in 1 op -- but it's lousy on x64.

All of that said -- ARM's performance improvement should be as good as the performance improvement for x64 on all of the unsigned cases today. It'll be even better when once the proposal for lifting reused constant intermediates is finished. This turns the signed cases into a net 5 instruction solution -- instead of 8.

Here are some extra use cases that show how these are used elsewhere: https://github.com/dkfrankandersen/ITU_ResearcProject_Scann/blob/eaba125ccbaa78a6a21bcb7400c9a10321d5a6cf/scann/scann/distance_measures/one_to_one/dot_product_sse4.cc#L180


ngzhian commented 3 years ago

I looked at the meetings notes, main takeaways:

It'll be even better when once the proposal for lifting reused constant intermediates is finished.

this is going to take a while, until then we have to live with the performance cliffs

For every conversion from 8 to 32 on x64, it takes a minimum of 3 shuffle ops to get from 1x8x16 to 2x16x8, then another 6 to go from 2x16x8 to 4x32x4 yielding 9 instructions and 9 shuffle ops.

I don't understand this part, as above:

i32x4.widen_i8x16_u(v128: a, ImmLaneIdx4: c) -> v128
        movdqa   xmm_out, xmm_a
# when c=0
       pmovzxbd  xmm_out, xmm_out
# when c=1..3
        pshufb  xmm_out, mask_i32x4_i8x16_u$c

Is a single shuffle. How are you getting 9 instructions?

If we were to ignore X->64 for a second, all the 8->32 instructions look like convenience wrappers or groupings around instructions we already have.

My main point is that our existing use cases don't benefit from this set of instruction (especially ->64). Pushing this to post-mvp will help reduce the surface area we need to work on to get to Phase 4, which makes SIMD more useful because we get it closer to the hands of all users.

omnisip commented 3 years ago

For every conversion from 8 to 32 on x64, it takes a minimum of 3 shuffle ops to get from 1x8x16 to 2x16x8, then another 6 to go from 2x16x8 to 4x32x4 yielding 9 instructions and 9 shuffle ops.

The 9 instructions and 9 shuffles is what it takes without these proposed instructions.

omnisip commented 3 years ago

If you prototype these for me, you can ditch the 64 bit ones. This was drafted to be fully complete by the submission deadline.

That leaves only two instructions. With the external reference support in v8 making it possible to do aligned loads, we can (and probably should) implement these with memory arguments. The performance should be excellent and it'll provide good support for 8 bit to float conversions which is often a subsequent step for these.

@Maratyszcza are there any outstanding proposals that would justify keeping the 16 to 64 variants? What would stand out would be something that allowed conversion of i64s to doubles.

Maratyszcza commented 3 years ago

i64x2->f64x2 conversion is not supported on x86 until AVX512, so it is not in WAsm SIMD.

ngzhian commented 3 years ago

The 9 instructions and 9 shuffles is what it takes without these proposed instructions.

Instead of the stepwise conversion, you can emit the single pshufb you need to get from 8x16 -> 32x4. Does that not work?

omnisip commented 3 years ago

The 9 instructions and 9 shuffles is what it takes without these proposed instructions.

Instead of the stepwise conversion, you can emit the single pshufb you need to get from 8x16 -> 32x4. Does that not work?

How? The underlying implementations with swizzle and shuffle for this end up regenerating the constants or intermediate constants each time and perform more than one shuffle plus an add and/or a blend.

ngzhian commented 3 years ago

How? The underlying implementations with swizzle and shuffle for this end up regenerating the constants or intermediate constants each time and perform more than one shuffle plus an add and/or a blend.

I see you what you mean. Though in this case it's not as bad as 9 instructions. Swizzle would be 4 (I can probably make it down to 3 by loading the mask we are adding from memory), and shuffle will also be about 4, with the immediates being regenerated.

Wasn't there a feature request about checking swizzle's input for v128.const, and eliding the adds for that? That would bring it down to a single instruction for your use case right?

omnisip commented 3 years ago

The 9 instructions was for a full set of 1x8x16 to 4x32x4 stepwise using the regular conversion operators. It's a lot better than 3-4 per swizzle/shuffle which is 12 or 16.

Not sure about the features request for swizzle with v128.const. I looked at implementing something like that once upon a time, but found that it wouldn't conform with the specs definition for shuffle or swizzle.

ngzhian commented 3 years ago

full set of 1x8x16 to 4x32x4 stepwise

Oh, a full set. I missed that, sorry.

Not sure about the features request for swizzle with v128.const. I looked at implementing something like that once upon a time, but found that it wouldn't conform with the specs definition for shuffle or swizzle.

We can probably discuss this more, it shouldn't violate spec definition. swizzle(v128, v128.const), we can pattern match on the mask being a v128.const, and look at the underlying u8, if they are all either within bounds of have the top bit set, then we don't need to emit the add.

omnisip commented 3 years ago

The biggest challenge with swizzle/shuffle optimization is that it wouldn't cover any of the signed cases. It is probably worth optimizing them for the unsigned cases.

Is there any risk to prototyping just the two instructions? If you look at the agenda ticket, (#410) it looks like a lot of people wanted to see the benchmarks for this in today's meeting.

ngzhian commented 3 years ago

I'm starting to prototype i32x4.widen_i8x16_s (0xfd67) and i32x4.widen_i8x16_u (0xfd68). @tlively maybe we can get the tools prototyping started too?

ngzhian commented 3 years ago

https://crrev.com/c/2617389 has x64 prototype, you should see it in canary soon (EOD or tomorrow).

tlively commented 3 years ago

@omnisip I have an LLVM patch up for the i8x16 to i32x4 variants of these instructions: https://reviews.llvm.org/D95557. If you patch it into llvm and point a tot Emscripten installation at it, it should just work. Alternatively you can wait for it to land upstream (hopefully very soon) and emsdk install tot to get it without building anything yourself.

Edit: I forgot to mention that there is no Binaryen implementation right now, but if you do not pass optimization flags at link time and you do pass -sWASM_BIGINT, that shouldn't be a problem.

omnisip commented 3 years ago

I'm going to do everything in my power to have this tested and ready for the next meeting. Do you have any suggestions on how to test the ARM variants?

omnisip commented 3 years ago

Side note: @ngzhian the lowerings look really good on x64.

ngzhian commented 3 years ago

Can you try the x64 benchmarks first? Can the results be extrapolated to ARM? Otw I can prototype it on arm64.

omnisip commented 3 years ago

On x64, we get the benefit of the aligned load arguments which are going to produce a significant performance benefit over the multiple shuffles. On A64, I'm not totally sure how it's going to perform relative to stepwise expansion in real workflows.

omnisip commented 3 years ago


I'm getting: [parse exception: invalid code after SIMD prefix: 103 (at 0:216332)]


env -     PATH=/opt/emsdk/upstream/emscripten:/opt/emsdk:/opt/emsdk/node/12.18.1_64bit/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/dan/.rvm/bin     PWD=/proc/self/cwd   /opt/emsdk/upstream/emscripten/emcc -o bazel-out/wasm-opt/bin/elu_bench bazel-out/wasm-opt/bin/_objs/elu_bench/elu.o bazel-out/wasm-opt/bin/libXNNPACK.a bazel-out/wasm-opt/bin/libmemory_planner.a bazel-out/wasm-opt/bin/liboperator_run.a bazel-out/wasm-opt/bin/liboperators.a bazel-out/wasm-opt/bin/libindirection.a bazel-out/wasm-opt/bin/libpacking.a bazel-out/wasm-opt/bin/libscalar_ukernels.a bazel-out/wasm-opt/bin/libwasm_ukernels.a bazel-out/wasm-opt/bin/libtables.a bazel-out/wasm-opt/bin/libasm_ukernels.a bazel-out/wasm-opt/bin/libbench_utils.a bazel-out/wasm-opt/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-opt/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-opt/bin/external/clog/libclog.a bazel-out/wasm-opt/bin/external/pthreadpool/libpthreadpool.a -s 'ASSERTIONS=1' -s 'ERROR_ON_UNDEFINED_SYMBOLS=1' -s 'EXIT_RUNTIME=1' -s 'ALLOW_MEMORY_GROWTH=1' -s 'TOTAL_MEMORY=268435456' --pre-js ./preamble.js.lds -pthread -msimd128 -g -sWASM_BIGINT -msimd128 -s 'USE_PTHREADS=0' -s 'ERROR_ON_UNDEFINED_SYMBOLS=0' '-Wl,--export=__heap_base' '-Wl,--export=__data_end'
tlively commented 3 years ago

That's the expected opcode for i32x4.widen_i8x16_s. Is your V8 new enough?

omnisip commented 3 years ago

Yeah. I built it from source today (git pull && gclient sync && tools/dev/gm.py x64.release) -- but this is at a build step... why would v8 be involved here?

tlively commented 3 years ago

Oh, sorry, I didn't read your command line carefully enough and assumed the error was coming from V8. You're using -sWASM_BIGINT and not using optimization flags, so I don't know why it's trying to invoke Binaryen. Maybe if you add -s ERROR_ON_WASM_CHANGES_AFTER_LINK it will give you more information?

omnisip commented 3 years ago

No dice.

n@dl360:~/applications/wrapper/xnnpack$ /opt/emsdk/upstream/emscripten/emcc -o bazel-out/wasm-dbg/bin/elu_bench bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a -s 'ASSERTIONS=1' -s 'ERROR_ON_UNDEFINED_SYMBOLS=1' -s 'EXIT_RUNTIME=1' -s 'ALLOW_MEMORY_GROWTH=1' -s 'TOTAL_MEMORY=268435456' --pre-js ./preamble.js.lds -pthread -msimd128 -g -sWASM_BIGINT -sERROR_ON_WASM_CHANGES_AFTER_LINK  -msimd128 -s 'USE_PTHREADS=0' -s 'ERROR_ON_UNDEFINED_SYMBOLS=0' '-Wl,--export=__heap_base' '-Wl,--export=__data_end'
emcc: warning: LLVM version appears incorrect (seeing "13.0", expected "12.0") [-Wversion-check]
[parse exception: invalid code after SIMD prefix: 103 (at 0:222206)]
Fatal: error in parsing input
emcc: error: '/opt/emsdk/upstream/bin/wasm-emscripten-finalize --detect-features --minimize-wasm-changes -g --bigint --no-dyncalls --no-legalize-javascript-ffi --dwarf bazel-out/wasm-dbg/bin/elu_bench.wasm' failed (1)
tlively commented 3 years ago

Can you send me the logs with environment variable EMCC_DEBUG=1?

tlively commented 3 years ago

Also, emcc: warning: LLVM version appears incorrect (seeing "13.0", expected "12.0") doesn't look good. Are you using the latest version of Emscripten from emsdk?

omnisip commented 3 years ago

Yep. I just did emsdk install tot, an hour ago.

I'm assuming I had to compile llvm from source, so I pointed emscripten to point at my new llvm build. Is that wrong?

penzn commented 3 years ago

I don't know if that is what's going on, but LLVM just rolled the version from 12 to 13 very recently (maybe even yesterday), maybe emscripten's version detection hasn't gotten the memo.

tlively commented 3 years ago

Weird, it looks like that expectation was updated yesterday. @omnisip, did you do emsdk update-tags before installing tot?

omnisip commented 3 years ago

Weird, it looks like that expectation was updated yesterday. @omnisip, did you do emsdk update-tags before installing tot?

Don't recall doing emsdk update-tags, but I think I did emsdk update. I'll redo it again if that helps.

omnisip commented 3 years ago

Same issue, but the warning flags are gone now -- so that's a plus.

dan@dl360:~/applications/wrapper/xnnpack$ EMCC_DEBUG=1 /opt/emsdk/upstream/emscripten/emcc -o bazel-out/wasm-dbg/bin/elu_bench bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a -s 'ASSERTIONS=1' -s 'ERROR_ON_UNDEFINED_SYMBOLS=1' -s 'EXIT_RUNTIME=1' -s 'ALLOW_MEMORY_GROWTH=1' -s 'TOTAL_MEMORY=268435456' --pre-js ./preamble.js.lds -pthread -msimd128 -g -sWASM_BIGINT -sERROR_ON_WASM_CHANGES_AFTER_LINK -msimd128 -s 'USE_PTHREADS=0' -s 'ERROR_ON_UNDEFINED_SYMBOLS=0' '-Wl,--export=__heap_base' '-Wl,--export=__data_end'
tools.filelock:DEBUG: Attempting to acquire lock 139907621271968 on /tmp/emscripten_temp/emscripten.lock
tools.filelock:DEBUG: Lock 139907621271968 acquired on /tmp/emscripten_temp/emscripten.lock
emcc:WARNING: invocation: /opt/emsdk/upstream/emscripten/emcc.py -o bazel-out/wasm-dbg/bin/elu_bench bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a -s ASSERTIONS=1 -s ERROR_ON_UNDEFINED_SYMBOLS=1 -s EXIT_RUNTIME=1 -s ALLOW_MEMORY_GROWTH=1 -s TOTAL_MEMORY=268435456 --pre-js ./preamble.js.lds -pthread -msimd128 -g -sWASM_BIGINT -sERROR_ON_WASM_CHANGES_AFTER_LINK -msimd128 -s USE_PTHREADS=0 -s ERROR_ON_UNDEFINED_SYMBOLS=0 -Wl,--export=__heap_base -Wl,--export=__data_end  (in /home/dan/applications/wrapper/xnnpack)
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/clang --version
cache:DEBUG: PID 208108 acquiring multiprocess file lock to Emscripten cache at /opt/emsdk/upstream/emscripten/cache
tools.filelock:DEBUG: Attempting to acquire lock 139907621272112 on /opt/emsdk/upstream/emscripten/cache/cache.lock
tools.filelock:DEBUG: Lock 139907621272112 acquired on /opt/emsdk/upstream/emscripten/cache/cache.lock
cache:DEBUG: done
shared:DEBUG: sanity file up-to-date but check forced: /opt/emsdk/upstream/emscripten/cache/sanity.txt
shared:DEBUG: successfully executed /opt/emsdk/node/12.18.1_64bit/bin/node --version
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/llc --version
shared:INFO: (Emscripten: Running sanity checks)
shared:DEBUG: successfully executed /opt/emsdk/node/12.18.1_64bit/bin/node -e console.log("hello")
tools.filelock:DEBUG: Attempting to release lock 139907621272112 on /opt/emsdk/upstream/emscripten/cache/cache.lock
tools.filelock:DEBUG: Lock 139907621272112 released on /opt/emsdk/upstream/emscripten/cache/cache.lock
cache:DEBUG: PID 208108 released multiprocess file lock to Emscripten cache at /opt/emsdk/upstream/emscripten/cache
diagnostics:DEBUG: disabled warning: use of legacy setting: TOTAL_MEMORY (setting renamed to INITIAL_MEMORY) [-Wlegacy-settings]
emcc:DEBUG: compiling to bitcode
emcc:DEBUG: emcc step "parse arguments and setup" took 0.12 seconds
emcc:DEBUG: using object file: bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libXNNPACK.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libmemory_planner.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/liboperator_run.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/liboperators.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libindirection.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/liblogging_utils.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libpacking.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libscalar_ukernels.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libwasm_ukernels.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libtables.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libasm_ukernels.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/libbench_utils.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/clog/libclog.a
emcc:DEBUG: using static library: bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a
emcc:DEBUG: emcc step "compile inputs" took 0.00 seconds
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libXNNPACK.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libmemory_planner.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/liboperator_run.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/liboperators.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libindirection.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/liblogging_utils.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libpacking.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libscalar_ukernels.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libwasm_ukernels.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libtables.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libasm_ukernels.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/libbench_utils.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/clog/libclog.a
shared:DEBUG: executed /opt/emsdk/upstream/bin/llvm-nm /home/dan/applications/wrapper/xnnpack/bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a
system_libs:DEBUG: adding dependency on malloc due to deps-info on realloc
system_libs:DEBUG: adding dependency on free due to deps-info on realloc
system_libs:DEBUG: adding dependency on malloc due to deps-info on getenv
system_libs:DEBUG: adding dependency on free due to deps-info on getenv
system_libs:DEBUG: adding dependency on malloc due to deps-info on gmtime_r
system_libs:DEBUG: adding dependency on free due to deps-info on gmtime_r
system_libs:DEBUG: adding dependency on _get_tzname due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on _get_daylight due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on _get_timezone due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on malloc due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on free due to deps-info on localtime_r
system_libs:DEBUG: adding dependency on malloc due to deps-info on pthread_create
system_libs:DEBUG: adding dependency on free due to deps-info on pthread_create
system_libs:DEBUG: adding dependency on emscripten_main_thread_process_queued_calls due to deps-info on pthread_create
system_libs:DEBUG: adding dependency on malloc due to deps-info on calloc
system_libs:DEBUG: adding dependency on free due to deps-info on calloc
system_libs:DEBUG: including libgl (libgl.a)
system_libs:DEBUG: including libal (libal.a)
system_libs:DEBUG: including libhtml5 (libhtml5.a)
system_libs:DEBUG: including libc (libc.a)
system_libs:DEBUG: including libcompiler_rt (libcompiler_rt.a)
system_libs:DEBUG: including libc++ (libc++-noexcept.a)
system_libs:DEBUG: including libc++abi (libc++abi-noexcept.a)
system_libs:DEBUG: including libmalloc (libdlmalloc.a)
system_libs:DEBUG: including libc_rt_wasm (libc_rt_wasm.a)
system_libs:DEBUG: including libsockets (libsockets.a)
emcc:DEBUG: emcc step "calculate system libraries" took 0.46 seconds
emcc:DEBUG: linking: ['bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o', 'bazel-out/wasm-dbg/bin/libXNNPACK.a', 'bazel-out/wasm-dbg/bin/libmemory_planner.a', 'bazel-out/wasm-dbg/bin/liboperator_run.a', 'bazel-out/wasm-dbg/bin/liboperators.a', 'bazel-out/wasm-dbg/bin/libindirection.a', 'bazel-out/wasm-dbg/bin/liblogging_utils.a', 'bazel-out/wasm-dbg/bin/libpacking.a', 'bazel-out/wasm-dbg/bin/libscalar_ukernels.a', 'bazel-out/wasm-dbg/bin/libwasm_ukernels.a', 'bazel-out/wasm-dbg/bin/libtables.a', 'bazel-out/wasm-dbg/bin/libasm_ukernels.a', 'bazel-out/wasm-dbg/bin/libbench_utils.a', 'bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a', 'bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a', 'bazel-out/wasm-dbg/bin/external/clog/libclog.a', 'bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a', '--export=__heap_base', '--export=__data_end', '-L/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libgl.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libal.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libhtml5.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libcompiler_rt.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++-noexcept.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++abi-noexcept.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libdlmalloc.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc_rt_wasm.a', '/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libsockets.a']
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/wasm-ld -o bazel-out/wasm-dbg/bin/elu_bench.wasm bazel-out/wasm-dbg/bin/_objs/elu_bench/elu.o bazel-out/wasm-dbg/bin/libXNNPACK.a bazel-out/wasm-dbg/bin/libmemory_planner.a bazel-out/wasm-dbg/bin/liboperator_run.a bazel-out/wasm-dbg/bin/liboperators.a bazel-out/wasm-dbg/bin/libindirection.a bazel-out/wasm-dbg/bin/liblogging_utils.a bazel-out/wasm-dbg/bin/libpacking.a bazel-out/wasm-dbg/bin/libscalar_ukernels.a bazel-out/wasm-dbg/bin/libwasm_ukernels.a bazel-out/wasm-dbg/bin/libtables.a bazel-out/wasm-dbg/bin/libasm_ukernels.a bazel-out/wasm-dbg/bin/libbench_utils.a bazel-out/wasm-dbg/bin/external/com_google_benchmark/libbenchmark.a bazel-out/wasm-dbg/bin/external/cpuinfo/libcpuinfo_impl.a bazel-out/wasm-dbg/bin/external/clog/libclog.a bazel-out/wasm-dbg/bin/external/pthreadpool/libpthreadpool.a --export=__heap_base --export=__data_end -L/opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libgl.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libal.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libhtml5.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libcompiler_rt.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++-noexcept.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc++abi-noexcept.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libdlmalloc.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libc_rt_wasm.a /opt/emsdk/upstream/emscripten/cache/sysroot/lib/wasm32-emscripten/libsockets.a -mllvm -combiner-global-alias-analysis=false -mllvm -enable-emscripten-sjlj -mllvm -disable-lsr --allow-undefined --export main --export emscripten_stack_get_end --export emscripten_stack_get_free --export emscripten_stack_init --export stackSave --export stackRestore --export stackAlloc --export __wasm_call_ctors --export fflush --export __errno_location --export malloc --export free --export _get_tzname --export _get_daylight --export _get_timezone --export emscripten_main_thread_process_queued_calls --export-table -z stack-size=5242880 --initial-memory=268435456 --no-entry --max-memory=2147483648 --global-base=1024
emcc:DEBUG: emcc step "link" took 0.12 seconds
emcc:DEBUG: emscript
building:DEBUG: saving debug copy /tmp/emscripten_temp/emcc-0-base.wasm
shared:DEBUG: successfully executed /opt/emsdk/upstream/bin/wasm-opt --version
[parse exception: invalid code after SIMD prefix: 103 (at 0:222205)]
Fatal: error in parsing input
emcc: error: '/opt/emsdk/upstream/bin/wasm-emscripten-finalize --detect-features --minimize-wasm-changes -g --bigint --no-dyncalls --no-legalize-javascript-ffi --dwarf bazel-out/wasm-dbg/bin/elu_bench.wasm' failed (1)
tools.filelock:DEBUG: Attempting to release lock 139907621271968 on /tmp/emscripten_temp/emscripten.lock
tools.filelock:DEBUG: Lock 139907621271968 released on /tmp/emscripten_temp/emscripten.lock