Closed saucecontrol closed 4 years ago
I'm still learning how to read JITDumps, but I can see here where morph introduces the float -> double cast. Can't tell why, but I'm learning something, I think...
Morphing BB03 of 'GreyConverter:AdjustBrightness(ref,ref,float):this'
fgMorphTree BB03, stmt 3 (before)
[000032] ---XG------- /--* CAST int <- ubyte <- float
[000030] ------------ | | /--* LCL_VAR float V03 arg3
[000031] ---XG------- | \--* MUL float
[000029] ---XG------- | \--* CAST float <- int
[000027] ------------ | | /--* LCL_VAR int V04 loc0
[000028] ---XG------- | \--* INDEX ubyte
[000026] ------------ | \--* LCL_VAR ref V01 arg1
[000034] -A-XG------- * ASG int
[000033] D------N---- \--* LCL_VAR int V05 tmp0
GenTreeNode creates assertion:
[000050] ---X-------- * ARR_LENGTH int
In BB03 New Local Constant Assertion: V01 != null index=#01, mask=0000000000000001
GenTreeNode creates assertion:
[000034] -A-XG------- * ASG int
In BB03 New Local Subrange Assertion: V05 in [0..255] index=#02, mask=0000000000000002
fgMorphTree BB03, stmt 3 (after)
[000032] ---XG+------ /--* CAST int <- ubyte <- int
[000047] ---XG+------ | \--* CAST int <- double
[000046] ---XG+------ | \--* CAST double <- float
[000030] -----+------ | | /--* LCL_VAR float V03 arg3
[000031] ---XG+------ | \--* MUL float
[000029] ---XG+------ | \--* CAST float <- int
[000028] a--XG+------ | | /--* IND ubyte
[000053] -----+------ | | | | /--* CNS_INT int 8 Fseq[#FirstElem]
[000054] -----+------ | | | \--* ADD byref
[000049] i----+------ | | | | /--* LCL_VAR int V04 loc0
[000052] -----+------ | | | \--* ADD byref
[000048] -----+------ | | | \--* LCL_VAR ref V01 arg1
[000055] ---XG+------ | \--* COMMA ubyte
[000051] ---X-+------ | \--* ARR_BOUNDS_CHECK_Rng void
[000027] -----+------ | +--* LCL_VAR int V04 loc0
[000050] ---X-+------ | \--* ARR_LENGTH int
[000026] -----+------ | \--* LCL_VAR ref V01 arg1
[000034] -A-XG+------ * ASG int
[000033] D----+-N---- \--* LCL_VAR int V05 tmp0
Setting EnableAVX
to 0, I get the following asm, so nothing to do with the VEX encoding
; Assembly listing for method GreyConverter:AdjustBrightness(ref,ref,float):this
; Emitting BLENDED_CODE for generic X86 CPU
; optimized code
; ebp based frame
; fully interruptible
; Final local variable assignments
;
;* V00 this [V00 ] ( 0, 0 ) ref -> zero-ref this class-hnd
; V01 arg1 [V01,T03] ( 7, 13 ) ref -> edx class-hnd
; V02 arg2 [V02,T05] ( 5, 11 ) ref -> eax class-hnd
; V03 arg3 [V03,T06] ( 2, 8 ) float -> mm0
; V04 loc0 [V04,T00] ( 13, 48 ) int -> edi
; V05 tmp0 [V05,T01] ( 4, 32 ) int -> ebx
; V06 cse0 [V06,T02] ( 13, 23 ) int -> esi
; V07 cse1 [V07,T04] ( 7, 12.50) int -> ecx
;
; Lcl frame size = 0
G_M46214_IG01:
55 push ebp
8BEC mov ebp, esp
57 push edi
56 push esi
53 push ebx
8B450C mov eax, gword ptr [ebp+0CH]
F30F104508 movss xmm0, dword ptr [ebp+08H]
G_M46214_IG02:
8B4A04 mov ecx, dword ptr [edx+4]
8B7004 mov esi, dword ptr [eax+4]
3BCE cmp ecx, esi
722E jb SHORT G_M46214_IG04
33FF xor edi, edi
85F6 test esi, esi
7E28 jle SHORT G_M46214_IG04
G_M46214_IG03:
3BF9 cmp edi, ecx
732B jae SHORT G_M46214_IG05
0FB65C3A08 movzx ebx, byte ptr [edx+edi+8]
0F57C9 xorps xmm1, xmm1
F30F2ACB cvtsi2ss xmm1, ebx
F30F59C8 mulss xmm1, xmm0
F30F5AC9 cvtss2sd xmm1, xmm1
F20F2CD9 cvttsd2si ebx, xmm1
0FB6DB movzx ebx, bl
885C3808 mov byte ptr [eax+edi+8], bl
47 inc edi
3BF7 cmp esi, edi
7FD8 jg SHORT G_M46214_IG03
G_M46214_IG04:
5B pop ebx
5E pop esi
5F pop edi
5D pop ebp
C20800 ret 8
G_M46214_IG05:
E8EE59540A call CORINFO_HELP_RNGCHKFAIL
CC int3
; Total bytes of code 83, prolog size 14 for method GreyConverter:AdjustBrightness(ref,ref,float):this
; ============================================================
There are various notes in fgMorphCast
indicating that x86 needs to do a two-step conversion, and logic to make it so.
This may be an artifact of the days (not so long ago) when x86 floating point used x87 instructions -- so it may no longer be true.
cc @dotnet/jit-contrib
This may be an artifact of the days (not so long ago) when x86 floating point used x87 instructions -- so it may no longer be true.
Oh, right... Legacy JIT does this
07e89678 0fb6443108 movzx eax,byte ptr [ecx+esi+8]
07e8967d 8945ec mov dword ptr [ebp-14h],eax
07e89680 db45ec fild dword ptr [ebp-14h]
07e89683 d84d08 fmul dword ptr [ebp+8]
07e89686 dd5de4 fstp qword ptr [ebp-1Ch]
07e89689 f20f1045e4 movsd xmm0,mmword ptr [ebp-1Ch]
07e8968e f20f2cd8 cvttsd2si ebx,xmm0
07e89692 81e3ff000000 and ebx,0FFh
07e89698 8b450c mov eax,dword ptr [ebp+0Ch]
07e8969b 885c3008 mov byte ptr [eax+esi+8],bl
For SSE/SSE2 code, converting values less than 2^23 directly from float
to the respective integer
type should be exact.
There may be some edge case for values greater than that, but it should be trivial to validate for all binary32
inputs in either case.
RyuJIT already does the direct conversion on x64, so wouldn't it be safe to use the same cvttss2si
on x86? Seems like that extra conversion is purely a holdover from jit32 and could just be removed.
https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L154-L156 and https://github.com/dotnet/coreclr/blob/master/src/jit/morph.cpp#L213-L216
I agree; it seems that those should be removed.
The movzx
after the conversion and before the store seems redundant also - is this tracked already somewhere or should I create a new issue?
I've been meaning to ask about that.
@mikedn, is that something that would be addressed by your fix for https://github.com/dotnet/coreclr/issues/12595?
I need to update my local build and double check but from the posted assembly and dumps it looks like the first movzx
is needed (you're loading a byte from memory) and the second is not (since it's made redundant by the byte store). However, from the dump it appears that the value is computed in a temporary variable and that makes the removal of the second movzx
more difficult.
It's not related to the stuff I'm working on, that deals with memory loads that can be combined with a subsequent cast.
Given the following method
RyuJIT32 produces this:
Note the
vcvtss2sd
followed byvcvttsd2si
. RyuJIT64 produces the expected code:That makes for a 13% speed difference in this method.
I also just noticed the bounds check is not elided here, just moved. I can't seem to get it to elide the check. Any ideas @mikedn?