RyuJIT: Fasta benchmark: hot method random() is not in-lined by legacy policy into SelectRandom()

sivarv commented 7 years ago

CqPerf version of this benchmark is very sensitive to in-lining. Forcing inline of random() would cause RyuJIT to beat Legacy Jit64 by 7.4% in execution perf. Model policy in-lines random() method into SelectRandom(). category:cq theme:inlining skill-level:expert cost:large

AndyAyersMS commented 7 years ago

The method in question is

    const int IM = 139968;
    const int IA = 3877;
    const int IC = 29573;
    static int seed = 42;

    static double random (double max)
    {
        return max * ((seed = (seed * IA + IC) % IM) * (1.0 / IM));
    }

This ends up being 22 bytes of IL because CSC turns 1.0 / IM into a double precision literal 7.144490169181527e-006. So the method size is greater than the legacy policy's always inline threshold.

AndyAyersMS commented 3 years ago

The code noted above still appears in the fasta-2 variant. The caller is SelectRandom. Current PGO work (w/ default policy) still does not inline the call to random -- the method call site has weight 1.0.

Also note profile data suggests the subsequent loop is not very hot, with weight 1.73 so tends not to iterate much (not clear yet if this profile data is accurate late in the jit pipeline, so will drill into that and the inlining heuristics).

; Assembly listing for method Fasta_2:SelectRandom(ref):ubyte
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 4297046
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  5,  4.73)     ref  ->  rsi         class-hnd
;  V01 loc0         [V01,T04] (  2,  2.73)  double  ->  mm0        
;  V02 loc1         [V02,T01] (  5,  4.92)     int  ->  rax        
;  V03 OutArgs      [V03    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V04 cse0         [V04,T02] (  3,  4.46)     ref  ->  rcx         "CSE - aggressive"
;  V05 cse1         [V05,T03] (  6,  2.73)     int  ->  rdx         "CSE - aggressive"
;
; Lcl frame size = 32

G_M9962_IG01:              ;; offset=0000H
       56                   push     rsi
       4883EC20             sub      rsp, 32
       C5F877               vzeroupper 
       488BF1               mov      rsi, rcx
                        ;; bbWeight=1    PerfScore 2.50
G_M9962_IG02:              ;; offset=000BH
       C5FB100555000000     vmovsd   xmm0, qword ptr [reloc @RWD00]
       E8206BD9FF           call     Fasta_2:random(double):double
       33C0                 xor      eax, eax
       8B5608               mov      edx, dword ptr [rsi+8]
       85D2                 test     edx, edx
       7E1C                 jle      SHORT G_M9962_IG05
                        ;; bbWeight=1    PerfScore 6.50
G_M9962_IG03:              ;; offset=0021H
       4863C8               movsxd   rcx, eax
       488B4CCE10           mov      rcx, gword ptr [rsi+8*rcx+16]
       C5FB104908           vmovsd   xmm1, qword ptr [rcx+8]
       C5F92EC8             vucomisd xmm1, xmm0
       7723                 ja       SHORT G_M9962_IG07
                        ;; bbWeight=1.73 PerfScore 10.81
G_M9962_IG04:              ;; offset=0034H
       FFC0                 inc      eax
       3BD0                 cmp      edx, eax
       7FE7                 jg       SHORT G_M9962_IG03
                        ;; bbWeight=0.73 PerfScore 1.09
G_M9962_IG05:              ;; offset=003AH
       8D42FF               lea      eax, [rdx-1]
       3BC2                 cmp      eax, edx
       731E                 jae      SHORT G_M9962_IG09
       FFCA                 dec      edx
       4863C2               movsxd   rax, edx
       488B44C610           mov      rax, gword ptr [rsi+8*rax+16]
       0FB64010             movzx    rax, byte  ptr [rax+16]
                        ;; bbWeight=0    PerfScore 0.00
G_M9962_IG06:              ;; offset=004FH
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      
                        ;; bbWeight=0    PerfScore 0.00
G_M9962_IG07:              ;; offset=0055H
       0FB64110             movzx    rax, byte  ptr [rcx+16]
                        ;; bbWeight=1    PerfScore 2.00
G_M9962_IG08:              ;; offset=0059H
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret      
                        ;; bbWeight=1    PerfScore 1.75
G_M9962_IG09:              ;; offset=005FH
       E84C3FA45F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     
                        ;; bbWeight=0    PerfScore 0.00
RWD00   dq  3FF0000000000000h   ;            1

; Total bytes of code 101, prolog size 11, PerfScore 35.06, instruction count 34 (MethodHash=a6f4d915) for method Fasta_2:SelectRandom(ref):ubyte

[edit: clarified the call site is not in a loop so no extra boost expected with PGO]

AndyAyersMS commented 3 years ago

Importer profile data shows loop is indeed not very hot and 1.73 is the right weight.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight       IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                            4976k  4975866    [000..013)-> BB05 (always)                     i IBC 
BB02 [0001]  1                            8574k  8574455    [013..01E)-> BB04 ( cond )                     i idxlen bwd bwd-target IBC 
BB03 [0002]  1                            4976k  4976439    [01E..027)        (return)                     i idxlen IBC 
BB04 [0003]  1                            3599k  3598662    [027..02B)                                     i bwd IBC 
BB05 [0004]  2                            8576k  8575838    [02B..031)-> BB02 ( cond )                     i idxlen bwd IBC 
BB06 [0005]  1                             0           0    [031..03E)        (return)                     i rare idxlen IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

Default heuristic (no PGO)

Invoking compiler for the inlinee method Fasta_2:random(double):double
...
Inline candidate callsite is boring.  Multiplier increased to 1.3.
calleeNativeSizeEstimate=525
callsiteNativeSizeEstimate=85
benefit multiplier=1.3
threshold=110
Native estimate for function size exceeds threshold for inlining 52.5 > 11 (multiplier = 1.3)

Inline expansion aborted, inline not profitable

Default heuristic (Tiered PGO). Here "warm" just means nonzero profile count. Note this requires QJFL=1 or SelectRandom will bypass tiering. We might consider adjusting the default heuristic as it only considers a site HOT if the calls site profile weight is BB_MAX_WEIGHT which (given #44983) is now very unlikely.

Inline candidate callsite is warm.  Multiplier increased to 2.
calleeNativeSizeEstimate=525
callsiteNativeSizeEstimate=85
benefit multiplier=2
threshold=170
Native estimate for function size exceeds threshold for inlining 52.5 > 17 (multiplier = 2)

New PGO heuristic (note call site frequency is 1, so no extra boost; also we predict this is a size decreasing inline):

Have profile data for call site...
Inline is profitable: benefit=0.195187 (perCall=0.195187, local=0.195187, global=1, size=-7.3)
Jump targets:
  none
Computing inlinee profile scale:
   call site count 4975866 callee entry count 5080000 scale 0.979501

assembly for the PGO case below, note size is definitely larger.

; Assembly listing for method Fasta_2:SelectRandom(ref):ubyte
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 7033212
; invoked as altjit
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  5,  4.72)     ref  ->  rcx         class-hnd
;  V01 loc0         [V01,T07] (  2,  2.72)  double  ->  mm0        
;  V02 loc1         [V02,T04] (  5,  4.88)     int  ->  rax        
;  V03 OutArgs      [V03    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T01] (  3,  6   )     int  ->   r8         "dup spill"
;  V05 tmp2         [V05,T02] (  3,  6   )     int  ->   r8         "fgInsertCommaFormTemp is creating a new local variable"
;  V06 cse0         [V06,T05] (  3,  4.44)     ref  ->   r8         "CSE - aggressive"
;  V07 cse1         [V07,T06] (  6,  2.72)     int  ->  rdx         "CSE - aggressive"
;  V08 rat0         [V08,T03] (  3,  6   )     int  ->  rdx         "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 40

G_M9962_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
       C5F877               vzeroupper 
                        ;; bbWeight=1    PerfScore 1.25
G_M9962_IG02:              ;; offset=0007H
       48B82410FFC1F87F0000 mov      rax, 0x7FF8C1FF1024
       446900250F0000       imul     r8d, dword ptr [rax], 0xF25
       4181C085730000       add      r8d, 0x7385
       BA8156F71D           mov      edx, 0x1DF75681
       8BC2                 mov      eax, edx
       41F7E8               imul     edx:eax, r8d
       8BC2                 mov      eax, edx
       C1E81F               shr      eax, 31
       C1FA0E               sar      edx, 14
       03C2                 add      eax, edx
       69C0C0220200         imul     eax, eax, 0x222C0
       442BC0               sub      r8d, eax
       48B82410FFC1F87F0000 mov      rax, 0x7FF8C1FF1024
       448900               mov      dword ptr [rax], r8d
       C5F857C0             vxorps   xmm0, xmm0
       C4C17B2AC0           vcvtsi2sd  xmm0, r8d
       C5FB590556000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD00]
       33C0                 xor      eax, eax
       8B5108               mov      edx, dword ptr [rcx+8]
       85D2                 test     edx, edx
       7E1D                 jle      SHORT G_M9962_IG05
                        ;; bbWeight=1    PerfScore 26.83
G_M9962_IG03:              ;; offset=0063H
       4C63C0               movsxd   r8, eax
       4E8B44C110           mov      r8, gword ptr [rcx+8*r8+16]
       C4C17B104808         vmovsd   xmm1, qword ptr [r8+8]
       C5F92EC8             vucomisd xmm1, xmm0
       7721                 ja       SHORT G_M9962_IG07
                        ;; bbWeight=1.72 PerfScore 10.75
G_M9962_IG04:              ;; offset=0077H
       FFC0                 inc      eax
       3BD0                 cmp      edx, eax
       7FE6                 jg       SHORT G_M9962_IG03
                        ;; bbWeight=0.72 PerfScore 1.08
G_M9962_IG05:              ;; offset=007DH
       8D42FF               lea      eax, [rdx-1]
       3BC2                 cmp      eax, edx
       731D                 jae      SHORT G_M9962_IG09
       FFCA                 dec      edx
       4863C2               movsxd   rax, edx
       488B44C110           mov      rax, gword ptr [rcx+8*rax+16]
       0FB64010             movzx    rax, byte  ptr [rax+16]
                        ;; bbWeight=0    PerfScore 0.00
G_M9962_IG06:              ;; offset=0092H
       4883C428             add      rsp, 40
       C3                   ret      
                        ;; bbWeight=0    PerfScore 0.00
G_M9962_IG07:              ;; offset=0097H
       410FB64010           movzx    rax, byte  ptr [r8+16]
                        ;; bbWeight=1    PerfScore 2.00
G_M9962_IG08:              ;; offset=009CH
       4883C428             add      rsp, 40
       C3                   ret      
                        ;; bbWeight=1    PerfScore 1.25
G_M9962_IG09:              ;; offset=00A1H
       E8EA3EA15F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     
                        ;; bbWeight=0    PerfScore 0.00
RWD00   dq  3EDDF75680FEB65Fh   ; 7.14449017e-06

; Total bytes of code 167, prolog size 7, PerfScore 60.16, instruction count 45 (MethodHash=a6f4d915) for method Fasta_2:SelectRandom(ref):ubyte

Profile data shows this is indeed where all the time is spent: For default:

  Jitted code     : 97.35% 4.56E+07 samples

02.37%   1.11E+06    ?        Unknown
42.75%   2.003E+07   FullOpt  [MicroBenchmarks]Fasta_2.SelectRandom(class Frequency[])
40.70%   1.907E+07   Tier-1   [MicroBenchmarks]Fasta_2.random(float64)
09.99%   4.68E+06    FullOpt  [MicroBenchmarks]Fasta_2.MakeRandomFasta(class System.String,class System.String,class Frequency[],int32,class System.IO.Stream)
03.84%   1.8E+06     FullOpt  [MicroBenchmarks]Fasta_2.MakeRepeatFasta(class System.String,class System.String,unsigned int8[],int32,class System.IO.Stream)
00.13%   6E+04       native   ntoskrnl.exe
00.09%   4E+04       native   coreclr.dll

and with PGO/inlining

  Jitted code     : 99.80% 4.47E+07 samples

95.62%   4.278E+07   Tier-1   [MicroBenchmarks]Fasta_2.MakeRandomFasta(class System.String,class System.String,class Frequency[],int32,class System.IO.Stream)
04.00%   1.79E+06    Tier-1   [MicroBenchmarks]Fasta_2.MakeRepeatFasta(class System.String,class System.String,unsigned int8[],int32,class System.IO.Stream)
00.13%   6E+04       native   ntoskrnl.exe
00.07%   3E+04       native   coreclr.dll

Per BDN the PGO inline version is not consistently faster in cycles. Should revisit once #44370 is merged.

AndyAyersMS commented 3 years ago

At any rate, PGO inlining does inline the hot methods here.

AndyAyersMS commented 3 years ago

Looking at instructions retired (per BDN)

;; default
Jit-generated code: 97.13% 2.74E+10 samples
  Jitted code     : 97.13% 2.74E+10 samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 69.78% 1.97E+10 samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 27.35% 7.72E+09 samples
  R2R code        : 00.00% 0        samples

02.18%   6.14E+08    ?        Unknown
50.69%   1.431E+10   FullOpt  [MicroBenchmarks]Fasta_2.SelectRandom(class Frequency[])
27.27%   7.694E+09   Tier-1   [MicroBenchmarks]Fasta_2.random(float64)
11.66%   3.289E+09   FullOpt  [MicroBenchmarks]Fasta_2.MakeRandomFasta(class System.String,class System.String,class Frequency[],int32,class System.IO.Stream)
07.42%   2.094E+09   FullOpt  [MicroBenchmarks]Fasta_2.MakeRepeatFasta(class System.String,class System.String,unsigned int8[],int32,class System.IO.Stream)
00.67%   1.9E+08     native   ntoskrnl.exe

;; TieredPGO / QJFL / Pgo inlining

Jit-generated code: 99.34% 2.43E+10 samples
  Jitted code     : 99.34% 2.43E+10 samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 00.00% 0        samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 99.34% 2.43E+10 samples
  R2R code        : 00.00% 0        samples

89.18%   2.177E+10   Tier-1   [MicroBenchmarks]Fasta_2.MakeRandomFasta(class System.String,class System.String,class Frequency[],int32,class System.IO.Stream)
10.02%   2.447E+09   Tier-1   [MicroBenchmarks]Fasta_2.MakeRepeatFasta(class System.String,class System.String,unsigned int8[],int32,class System.IO.Stream)
00.62%   1.52E+08    native   ntoskrnl.exe

So about 2.43/(2.73 + 0.06) = 0.87 reduction in instructions, and (in this run) similar reduction in cycles. So we'd expect about a 10% improvement overall.

dotnet / runtime

RyuJIT: Fasta benchmark: hot method random() is not in-lined by legacy policy into SelectRandom() #7311