RyuJIT32 performs unnecessary round-trip through double when converting float to byte

saucecontrol commented 6 years ago

Given the following method

void AdjustBrightness(byte[] pixelsin, byte[] pixelsout, float amount)
{
    if ((uint)pixelsin.Length >= (uint)pixelsout.Length)
    {
        for (int i = 0; i < pixelsout.Length; i++)
            pixelsout[i] = (byte)(pixelsin[i] * amount);
    }
}

RyuJIT32 produces this:

07999168 AdjustBrightness(Byte[], Byte[], Single)
0799917a 8b4a04          mov     ecx,dword ptr [edx+4]
0799917d 8b7004          mov     esi,dword ptr [eax+4]
07999180 3bce            cmp     ecx,esi
07999182 7234            jb      079991b8
07999184 33ff            xor     edi,edi
07999186 85f6            test    esi,esi
07999188 7e2e            jle     079991b8
0799918a 3bf9            cmp     edi,ecx
0799918c 7331            jae     079991bf
0799918e 0fb65c3a08      movzx   ebx,byte ptr [edx+edi+8]
07999193 c4e17057c9      vxorps  xmm1,xmm1,xmm1
07999198 c4e1722acb      vcvtsi2ss xmm1,xmm1,ebx
0799919d c4e17259c8      vmulss  xmm1,xmm1,xmm0
079991a2 c4e1725ac9      vcvtss2sd xmm1,xmm1,xmm1
079991a7 c4e17b2cd9      vcvttsd2si ebx,xmm1
079991ac 0fb6db          movzx   ebx,bl
079991af 885c3808        mov     byte ptr [eax+edi+8],bl
079991b3 47              inc     edi
079991b4 3bf7            cmp     esi,edi
079991b6 7fd2            jg      0799918a
079991b8 5b              pop     ebx

Note the vcvtss2sd followed by vcvttsd2si. RyuJIT64 produces the expected code:

00007ffc`2eb52910 AdjustBrightness(Byte[], Byte[], Single)
00007ffc`2eb52917 8b4208          mov     eax,dword ptr [rdx+8]
00007ffc`2eb5291a 418b4808        mov     ecx,dword ptr [r8+8]
00007ffc`2eb5291e 3bc1            cmp     eax,ecx
00007ffc`2eb52920 723a            jb      00007ffc`2eb5295c
00007ffc`2eb52922 4533c9          xor     r9d,r9d
00007ffc`2eb52925 85c9            test    ecx,ecx
00007ffc`2eb52927 7e33            jle     00007ffc`2eb5295c
00007ffc`2eb52929 443bc8          cmp     r9d,eax
00007ffc`2eb5292c 7333            jae     00007ffc`2eb52961
00007ffc`2eb5292e 4d63d1          movsxd  r10,r9d
00007ffc`2eb52931 460fb65c1210    movzx   r11d,byte ptr [rdx+r10+10h]
00007ffc`2eb52937 c4e17857c0      vxorps  xmm0,xmm0,xmm0
00007ffc`2eb5293c c4c17a2ac3      vcvtsi2ss xmm0,xmm0,r11d
00007ffc`2eb52941 c4e17a59c3      vmulss  xmm0,xmm0,xmm3
00007ffc`2eb52946 c4617a2cd8      vcvttss2si r11d,xmm0
00007ffc`2eb5294b 450fb6db        movzx   r11d,r11b
00007ffc`2eb5294f 47885c1010      mov     byte ptr [r8+r10+10h],r11b
00007ffc`2eb52954 41ffc1          inc     r9d
00007ffc`2eb52957 413bc9          cmp     ecx,r9d
00007ffc`2eb5295a 7fcd            jg      00007ffc`2eb52929
00007ffc`2eb5295c 4883c428        add     rsp,28h

That makes for a 13% speed difference in this method.

I also just noticed the bounds check is not elided here, just moved. I can't seem to get it to elide the check. Any ideas @mikedn?

saucecontrol commented 6 years ago

I'm still learning how to read JITDumps, but I can see here where morph introduces the float -> double cast. Can't tell why, but I'm learning something, I think...

Morphing BB03 of 'GreyConverter:AdjustBrightness(ref,ref,float):this'

fgMorphTree BB03, stmt 3 (before)
               [000032] ---XG-------              /--*  CAST      int <- ubyte <- float
               [000030] ------------              |  |  /--*  LCL_VAR   float  V03 arg3         
               [000031] ---XG-------              |  \--*  MUL       float 
               [000029] ---XG-------              |     \--*  CAST      float <- int
               [000027] ------------              |        |  /--*  LCL_VAR   int    V04 loc0         
               [000028] ---XG-------              |        \--*  INDEX     ubyte 
               [000026] ------------              |           \--*  LCL_VAR   ref    V01 arg1         
               [000034] -A-XG-------              *  ASG       int   
               [000033] D------N----              \--*  LCL_VAR   int    V05 tmp0         
GenTreeNode creates assertion:
               [000050] ---X--------              *  ARR_LENGTH int   
In BB03 New Local Constant Assertion: V01 != null index=#01, mask=0000000000000001
GenTreeNode creates assertion:
               [000034] -A-XG-------              *  ASG       int   
In BB03 New Local Subrange Assertion: V05 in [0..255] index=#02, mask=0000000000000002

fgMorphTree BB03, stmt 3 (after)
               [000032] ---XG+------              /--*  CAST      int <- ubyte <- int
               [000047] ---XG+------              |  \--*  CAST      int <- double
               [000046] ---XG+------              |     \--*  CAST      double <- float
               [000030] -----+------              |        |  /--*  LCL_VAR   float  V03 arg3         
               [000031] ---XG+------              |        \--*  MUL       float 
               [000029] ---XG+------              |           \--*  CAST      float <- int
               [000028] a--XG+------              |              |  /--*  IND       ubyte 
               [000053] -----+------              |              |  |  |  /--*  CNS_INT   int    8 Fseq[#FirstElem]
               [000054] -----+------              |              |  |  \--*  ADD       byref 
               [000049] i----+------              |              |  |     |  /--*  LCL_VAR   int    V04 loc0         
               [000052] -----+------              |              |  |     \--*  ADD       byref 
               [000048] -----+------              |              |  |        \--*  LCL_VAR   ref    V01 arg1         
               [000055] ---XG+------              |              \--*  COMMA     ubyte 
               [000051] ---X-+------              |                 \--*  ARR_BOUNDS_CHECK_Rng void  
               [000027] -----+------              |                    +--*  LCL_VAR   int    V04 loc0         
               [000050] ---X-+------              |                    \--*  ARR_LENGTH int   
               [000026] -----+------              |                       \--*  LCL_VAR   ref    V01 arg1         
               [000034] -A-XG+------              *  ASG       int   
               [000033] D----+-N----              \--*  LCL_VAR   int    V05 tmp0

Setting EnableAVX to 0, I get the following asm, so nothing to do with the VEX encoding

; Assembly listing for method GreyConverter:AdjustBrightness(ref,ref,float):this
; Emitting BLENDED_CODE for generic X86 CPU
; optimized code
; ebp based frame
; fully interruptible
; Final local variable assignments
;
;* V00 this         [V00    ] (  0,  0   )     ref  ->  zero-ref    this class-hnd
;  V01 arg1         [V01,T03] (  7, 13   )     ref  ->  edx         class-hnd
;  V02 arg2         [V02,T05] (  5, 11   )     ref  ->  eax         class-hnd
;  V03 arg3         [V03,T06] (  2,  8   )   float  ->  mm0
;  V04 loc0         [V04,T00] ( 13, 48   )     int  ->  edi
;  V05 tmp0         [V05,T01] (  4, 32   )     int  ->  ebx
;  V06 cse0         [V06,T02] ( 13, 23   )     int  ->  esi
;  V07 cse1         [V07,T04] (  7, 12.50)     int  ->  ecx
;
; Lcl frame size = 0

G_M46214_IG01:
       55           push     ebp
       8BEC         mov      ebp, esp
       57           push     edi
       56           push     esi
       53           push     ebx
       8B450C       mov      eax, gword ptr [ebp+0CH]
       F30F104508   movss    xmm0, dword ptr [ebp+08H]

G_M46214_IG02:
       8B4A04       mov      ecx, dword ptr [edx+4]
       8B7004       mov      esi, dword ptr [eax+4]
       3BCE         cmp      ecx, esi
       722E         jb       SHORT G_M46214_IG04
       33FF         xor      edi, edi
       85F6         test     esi, esi
       7E28         jle      SHORT G_M46214_IG04

G_M46214_IG03:
       3BF9         cmp      edi, ecx
       732B         jae      SHORT G_M46214_IG05
       0FB65C3A08   movzx    ebx, byte  ptr [edx+edi+8]
       0F57C9       xorps    xmm1, xmm1
       F30F2ACB     cvtsi2ss   xmm1, ebx
       F30F59C8     mulss    xmm1, xmm0
       F30F5AC9     cvtss2sd  xmm1, xmm1
       F20F2CD9     cvttsd2si  ebx, xmm1
       0FB6DB       movzx    ebx, bl
       885C3808     mov      byte  ptr [eax+edi+8], bl
       47           inc      edi
       3BF7         cmp      esi, edi
       7FD8         jg       SHORT G_M46214_IG03

G_M46214_IG04:
       5B           pop      ebx
       5E           pop      esi
       5F           pop      edi
       5D           pop      ebp
       C20800       ret      8

G_M46214_IG05:
       E8EE59540A   call     CORINFO_HELP_RNGCHKFAIL
       CC           int3

; Total bytes of code 83, prolog size 14 for method GreyConverter:AdjustBrightness(ref,ref,float):this
; ============================================================

AndyAyersMS commented 6 years ago

There are various notes in fgMorphCast indicating that x86 needs to do a two-step conversion, and logic to make it so.

This may be an artifact of the days (not so long ago) when x86 floating point used x87 instructions -- so it may no longer be true.

AndyAyersMS commented 6 years ago

cc @dotnet/jit-contrib

saucecontrol commented 6 years ago

This may be an artifact of the days (not so long ago) when x86 floating point used x87 instructions -- so it may no longer be true.

Oh, right... Legacy JIT does this

07e89678 0fb6443108      movzx   eax,byte ptr [ecx+esi+8]
07e8967d 8945ec          mov     dword ptr [ebp-14h],eax
07e89680 db45ec          fild    dword ptr [ebp-14h]
07e89683 d84d08          fmul    dword ptr [ebp+8]
07e89686 dd5de4          fstp    qword ptr [ebp-1Ch]
07e89689 f20f1045e4      movsd   xmm0,mmword ptr [ebp-1Ch]
07e8968e f20f2cd8        cvttsd2si ebx,xmm0
07e89692 81e3ff000000    and     ebx,0FFh
07e89698 8b450c          mov     eax,dword ptr [ebp+0Ch]
07e8969b 885c3008        mov     byte ptr [eax+esi+8],bl

tannergooding commented 6 years ago

For SSE/SSE2 code, converting values less than 2^23 directly from float to the respective integer type should be exact.

There may be some edge case for values greater than that, but it should be trivial to validate for all binary32 inputs in either case.

saucecontrol commented 6 years ago

RyuJIT already does the direct conversion on x64, so wouldn't it be safe to use the same cvttss2si on x86? Seems like that extra conversion is purely a holdover from jit32 and could just be removed.

https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L154-L156 and https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L213-L216

CarolEidt commented 6 years ago

I agree; it seems that those should be removed.

pentp commented 6 years ago

The movzx after the conversion and before the store seems redundant also - is this tracked already somewhere or should I create a new issue?

saucecontrol commented 6 years ago

I've been meaning to ask about that.

@mikedn, is that something that would be addressed by your fix for https://github.com/dotnet/coreclr/issues/12595?

mikedn commented 6 years ago

I need to update my local build and double check but from the posted assembly and dumps it looks like the first movzx is needed (you're loading a byte from memory) and the second is not (since it's made redundant by the byte store). However, from the dump it appears that the value is computed in a temporary variable and that makes the removal of the second movzx more difficult.

It's not related to the stuff I'm working on, that deals with memory loads that can be combined with a subsequent cast.

dotnet / runtime

RyuJIT32 performs unnecessary round-trip through double when converting float to byte #10609