dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.66k forks source link

JIT: PGO: fgOptimizeUncondBranchToSimpleCond corrupts profile data #50419

Open EgorBo opened 3 years ago

EgorBo commented 3 years ago

It looks like fgOptimizeUncondBranchToSimpleCond (as part of the Optimize Layouts phase) produces invalid weights in some cases. E.g.

Before Optimize layout: image

^ all edges look valid and satisfy Kirchhoff's Law.

After Optimize Layouts:

image

^ invalid edges: BB06 (in: 0 + 0.5, out: 1) BB09 (in: 0.25, out: 0)

I guess invalid edges are fine and will be re-calculated, but there are also invalid blocks such as BB09 which is a predecessor of a cold block but has weight 0.5 (BB05 should also be cold then?)

As the result, in the end I have:

; with IBC profile data, edge weights are invalid, and fgCalledCount is 36

But if I disable fgOptimizeUncondBranchToSimpleCond the final weights will be reported as valid:

image

^ "After Optimize Layout" without fgOptimizeUncondBranchToSimpleCond, all blocks look valid again (edges aren't though).

/cc @AndyAyersMS

category:correctness theme:profile-feedback skill-level:beginner cost:small impact:small

AndyAyersMS commented 3 years ago

optOptimizeLayout as a whole has similar problems -- don't recall if I tracked down which bits were responsible. See https://github.com/dotnet/runtime/pull/48364#issuecomment-780129816 and #48476.

EgorBo commented 3 years ago

Simple repro:

using System.Runtime.CompilerServices;
using System.Threading;

class Program
{
    static void Main()
    {
        for (int i = 0; i < 100; i++)
        {
            Test(new object());
            Thread.Sleep(16);
        }
    }

    static object someObj;

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Test(object o)
    {
        if (o != someObj)
            o = null;

        if (o != null)
            Test1();
    }

    [MethodImpl(MethodImplOptions.NoInlining)] static void Test1() { }
}

codegen for Test (tier1 with TieredPGO):

; Assembly listing for method Program:Test(System.Object)
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; partially interruptible
; with IBC profile data, edge weights are invalid, and fgCalledCount is 36
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )     ref  ->  rcx         class-hnd
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;
; Lcl frame size = 40

G_M3485_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
                        ;; bbWeight=1    PerfScore 0.25
G_M3485_IG02:              ;; offset=0004H
       48B8402D9FBF06020000 mov      rax, 0x206BF9F2D40
       483B08               cmp      rcx, gword ptr [rax]
       7505                 jne      SHORT G_M3485_IG04
                        ;; bbWeight=1    PerfScore 3.25
G_M3485_IG03:              ;; offset=0013H
       4885C9               test     rcx, rcx
       7505                 jne      SHORT G_M3485_IG05
                        ;; bbWeight=1    PerfScore 1.25
G_M3485_IG04:              ;; offset=0018H
       4883C428             add      rsp, 40
       C3                   ret      
                        ;; bbWeight=1    PerfScore 1.25
G_M3485_IG05:              ;; offset=001DH
       E8CEEAAFFF           call     Program:Test1()
       EBF4                 jmp      SHORT G_M3485_IG04
                        ;; bbWeight=0    PerfScore 0.00

image

EgorBo commented 3 years ago

A minor issue, doesn't affect perf/correctness so moving to 7.0