dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.39k stars 4.75k forks source link

Performance: JIT is emitting multiple conversion instructions when using float-math #77658

Open hopperpl opened 2 years ago

hopperpl commented 2 years ago
    public float GetComp5(uint color) {
      return color * (255 / 31f);
    }

converted to


C.GetComp5(UInt32)
    L0000: vzeroupper
    L0003: mov eax, edx
    L0005: vxorps xmm0, xmm0, xmm0
    L0009: vcvtsi2sd xmm0, xmm0, rax                 ; <-- 2x conversion, should use cvtsi2ss
    L000e: vcvtsd2ss xmm0, xmm0, xmm0
    L0012: vmulss xmm0, xmm0, [0x7ff9f52004a0]
    L001a: ret

there shouldn't be two conversions, integer directly to single-precision is possible


for reference, clang does this (in C++):

GetComp5(unsigned int):
        mov       edi, edi
        pxor      xmm0, xmm0
        cvtsi2ss  xmm0, rdi
        mulss     xmm0, DWORD PTR .LC2[rip]
        ret
ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
```c# public float GetComp5(uint color) { return color * (255 / 31f); } ``` converted to ```asm C.GetComp5(UInt32) L0000: vzeroupper L0003: mov eax, edx L0005: vxorps xmm0, xmm0, xmm0 L0009: vcvtsi2sd xmm0, xmm0, rax ; <-- 2x conversion, should use cvtsi2ss L000e: vcvtsd2ss xmm0, xmm0, xmm0 L0012: vmulss xmm0, xmm0, [0x7ff9f52004a0] L001a: ret ``` there shouldn't be two conversions, integer directly to single-precision is possible --- for reference, clang does this (in C++): ```asm GetComp5(unsigned int): mov edi, edi pxor xmm0, xmm0 cvtsi2ss xmm0, rdi mulss xmm0, DWORD PTR .LC2[rip] ret ```
Author: hopperpl
Assignees: -
Labels: `tenet-performance`, `area-CodeGen-coreclr`
Milestone: -
EgorBo commented 2 years ago

Should be pretty straightforward to fix, marking as help-wanted

image

tannergooding commented 2 years ago

Noting this can only be done on 64-bit. On 32-bit the required REX.W bit to force a 64-bit input isn't available.

hopperpl commented 2 years ago

Beware, that code is not the same. The multiplication factor has to be 255.0 / 31.0 = 8.2258 and not truncated down to 8.

You have to change the code to return color * (255 / (float) 31); and make sure there is no temporary double math. The term in parenthesis avoids an expensive floating division.

YoniFeng commented 1 year ago

@hopperpl @EgorBo

I see ryujit overview for learning about the JIT and workflow as a general getting started guide, and viewing jit dumps seems relevant as a "JIT-specific" workflow.

Any other docs you think are worth mentioning for a first timer?

(not taking this issue yet) edit: missed than issue was self-assigned

YoniFeng commented 1 year ago

@TIHan are you working on this? If not, want to give it to a first timer? (I'll probably need a little help though)

EgorBo commented 1 year ago

@YoniFeng sorry for the delayed response, I'm on a vacation.

I'd start from https://github.com/dotnet/runtime/blob/main/docs/workflow/debugging/coreclr/debugging-runtime.md

Build repo, build VS solution for runtime, then configure it to run for a simple hello world app and try to set a breakpoint in jit in some random place, e.g. fgOptimizeCast function

YoniFeng commented 1 year ago

Thanks, I got a working setup. It isn't immediately obvious to me how I can navigate/breakpoint for exactly the method I'm interested in (even though my example program is just a Main and the example method), but I only played around for a few minutes.

Now I'm trying to figure out what we even want to do here (I'm realize its super obvious to you). Sharing my learning/thoughts here "live", so I can be corrected on anything misunderstood (helping you help me). For TL;DR and order's sake, questions are enumerated and in bold.

The IL for the example code given is:

                IL_0000: ldarg.0
        IL_0001: conv.r.un
        IL_0002: conv.r4
        IL_0003: ldc.r4 8.225806
        IL_0008: mul
        IL_0009: ret

The short main description for conv.r.un and conv.r4 make it seem like there's wasteful double conversion here, but on reading the longer description it's necessary because conv.r.un might return a float64. (Does the short main description for conv.r.un require correction?)

Looking at the JITDump, we get the uint -> long -> double -> float chain right off the bat:

Importing BB01 (PC=000) of 'ConsoleApplication.Program:GetComp5(uint):float'
    [ 0]   0 (0x000) ldarg.0
    [ 1]   1 (0x001) conv.r.un
    [ 1]   2 (0x002) conv.r4
    [ 1]   3 (0x003) ldc.r4 8.2258062362670898
    [ 2]   8 (0x008) mul
    [ 1]   9 (0x009) ret

STMT00000 ( 0x000[E-] ... ??? )
               [000005] -----------                         *  RETURN    float
               [000004] -----------                         \--*  MUL       float
               [000002] -----------                            +--*  CAST      float <- double
               [000001] ---------U-                            |  \--*  CAST      double <- uint
               [000000] -----------                            |     \--*  LCL_VAR   int    V00 arg0
               [000003] -----------                            \--*  CNS_DBL   float  8.2258062362670898

This seems a little odd to me. While it's true there's no direct uint -> float32 x86 instruction (I had to google that, I'm not even vaguely familiar with x86/SSE/AVX etc), I find it a little odd that the conversion isn't represented as such here, and only at a later phase (lowering? maybe that's too late) is it converted into a concrete chain of instructions that get the job done. Naively, it feels like "implementation details" don't matter at this phase, so why the granularity?

1. Is it OK/expected that HIR is like this?

Tree remains as-is until the rationalization, so no "real" changes:

------------ BB01 [000..00A) (return), preds={} succs={}
               [000007] -----------                            IL_OFFSET void   INLRT @ 0x000[E-]
N001 (  3,  2) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0
                                                            /--*  t0     int
N002 (  4,  4) [000006] ---------U-                    t6 = *  CAST      long <- uint
                                                            /--*  t6     long
N003 ( 10, 10) [000001] -----------                    t1 = *  CAST      double <- long
                                                            /--*  t1     double
N004 ( 16, 16) [000002] -----------                    t2 = *  CAST      float <- double
N005 (  3,  2) [000003] -----------                    t3 =    CNS_DBL   float  8.2258062362670898
                                                            /--*  t2     float
                                                            +--*  t3     float
N006 ( 24, 22) [000004] -----------                    t4 = *  MUL       float
                                                            /--*  t4     float
N007 ( 25, 23) [000005] -----------                         *  RETURN    float

-------------------------------------------------------------------------------------------------------------------

Although earlier I wondered if the representation isn't "wrong"/"wasteful" and if we couldn't take care of things at lowering - from reading the JIT overview it seems like by the lowering phase it's more like things are almost "set in stone", i.e. we're going to emit machine code based on the LIR. So maybe it really is the HIR we need to be changing, at one of the earlier stages (morphing?)

2. Does the IR need to have been changed to uint -> long -> f32 (if applicable) before lowering is reached?

@tannergooding your comment seems generally applicable, but assuming what's needed here is a morph (removing the redundant long -> double), I'm missing if/how it's relevant. In CodeGen::genIntToFloatCast, I see the following check:

#if !defined(TARGET_64BIT)
    // We expect morph to replace long to float/double casts with helper calls
    noway_assert(!varTypeIsLong(srcType));
#endif // !defined(TARGET_64BIT)

Therefore, a morph optimization should be safe, right? (If I add the morph before the helper calls morph the leftover long -> float will get replaced. If I add it after, we're fine because if the long -> double -> float pattern is found it implicitly means we're not on 32 bit)

YoniFeng commented 1 year ago

@EgorBo @TIHan would love to continue working on this, need some guidance :)

(@TIHan I'll re-iterate that you can tell me to drop working on this, given you already self-assigned it)

aturnbul commented 2 months ago

@TIHan, @YoniFeng Is this still open and is anyone working on this right now? I'd be happy to work on it if a first-time Open Source contributor might be helpful.