Open sivarv opened 7 years ago
The method in question is
const int IM = 139968;
const int IA = 3877;
const int IC = 29573;
static int seed = 42;
static double random (double max)
{
return max * ((seed = (seed * IA + IC) % IM) * (1.0 / IM));
}
This ends up being 22 bytes of IL because CSC turns 1.0 / IM
into a double precision literal 7.144490169181527e-006
. So the method size is greater than the legacy policy's always inline threshold.
The code noted above still appears in the fasta-2 variant. The caller is SelectRandom
. Current PGO work (w/ default policy) still does not inline the call to random
-- the method call site has weight 1.0.
Also note profile data suggests the subsequent loop is not very hot, with weight 1.73 so tends not to iterate much (not clear yet if this profile data is accurate late in the jit pipeline, so will drill into that and the inlining heuristics).
; Assembly listing for method Fasta_2:SelectRandom(ref):ubyte
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 4297046
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 5, 4.73) ref -> rsi class-hnd
; V01 loc0 [V01,T04] ( 2, 2.73) double -> mm0
; V02 loc1 [V02,T01] ( 5, 4.92) int -> rax
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V04 cse0 [V04,T02] ( 3, 4.46) ref -> rcx "CSE - aggressive"
; V05 cse1 [V05,T03] ( 6, 2.73) int -> rdx "CSE - aggressive"
;
; Lcl frame size = 32
G_M9962_IG01: ;; offset=0000H
56 push rsi
4883EC20 sub rsp, 32
C5F877 vzeroupper
488BF1 mov rsi, rcx
;; bbWeight=1 PerfScore 2.50
G_M9962_IG02: ;; offset=000BH
C5FB100555000000 vmovsd xmm0, qword ptr [reloc @RWD00]
E8206BD9FF call Fasta_2:random(double):double
33C0 xor eax, eax
8B5608 mov edx, dword ptr [rsi+8]
85D2 test edx, edx
7E1C jle SHORT G_M9962_IG05
;; bbWeight=1 PerfScore 6.50
G_M9962_IG03: ;; offset=0021H
4863C8 movsxd rcx, eax
488B4CCE10 mov rcx, gword ptr [rsi+8*rcx+16]
C5FB104908 vmovsd xmm1, qword ptr [rcx+8]
C5F92EC8 vucomisd xmm1, xmm0
7723 ja SHORT G_M9962_IG07
;; bbWeight=1.73 PerfScore 10.81
G_M9962_IG04: ;; offset=0034H
FFC0 inc eax
3BD0 cmp edx, eax
7FE7 jg SHORT G_M9962_IG03
;; bbWeight=0.73 PerfScore 1.09
G_M9962_IG05: ;; offset=003AH
8D42FF lea eax, [rdx-1]
3BC2 cmp eax, edx
731E jae SHORT G_M9962_IG09
FFCA dec edx
4863C2 movsxd rax, edx
488B44C610 mov rax, gword ptr [rsi+8*rax+16]
0FB64010 movzx rax, byte ptr [rax+16]
;; bbWeight=0 PerfScore 0.00
G_M9962_IG06: ;; offset=004FH
4883C420 add rsp, 32
5E pop rsi
C3 ret
;; bbWeight=0 PerfScore 0.00
G_M9962_IG07: ;; offset=0055H
0FB64110 movzx rax, byte ptr [rcx+16]
;; bbWeight=1 PerfScore 2.00
G_M9962_IG08: ;; offset=0059H
4883C420 add rsp, 32
5E pop rsi
C3 ret
;; bbWeight=1 PerfScore 1.75
G_M9962_IG09: ;; offset=005FH
E84C3FA45F call CORINFO_HELP_RNGCHKFAIL
CC int3
;; bbWeight=0 PerfScore 0.00
RWD00 dq 3FF0000000000000h ; 1
; Total bytes of code 101, prolog size 11, PerfScore 35.06, instruction count 34 (MethodHash=a6f4d915) for method Fasta_2:SelectRandom(ref):ubyte
[edit: clarified the call site is not in a loop so no extra boost expected with PGO]
Importer profile data shows loop is indeed not very hot and 1.73 is the right weight.
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd weight IBC lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 4976k 4975866 [000..013)-> BB05 (always) i IBC
BB02 [0001] 1 8574k 8574455 [013..01E)-> BB04 ( cond ) i idxlen bwd bwd-target IBC
BB03 [0002] 1 4976k 4976439 [01E..027) (return) i idxlen IBC
BB04 [0003] 1 3599k 3598662 [027..02B) i bwd IBC
BB05 [0004] 2 8576k 8575838 [02B..031)-> BB02 ( cond ) i idxlen bwd IBC
BB06 [0005] 1 0 0 [031..03E) (return) i rare idxlen IBC
-----------------------------------------------------------------------------------------------------------------------------------------
Default heuristic (no PGO)
Invoking compiler for the inlinee method Fasta_2:random(double):double
...
Inline candidate callsite is boring. Multiplier increased to 1.3.
calleeNativeSizeEstimate=525
callsiteNativeSizeEstimate=85
benefit multiplier=1.3
threshold=110
Native estimate for function size exceeds threshold for inlining 52.5 > 11 (multiplier = 1.3)
Inline expansion aborted, inline not profitable
Default heuristic (Tiered PGO). Here "warm" just means nonzero profile count. Note this requires QJFL=1 or SelectRandom
will bypass tiering. We might consider adjusting the default heuristic as it only considers a site HOT
if the calls site profile weight is BB_MAX_WEIGHT
which (given #44983) is now very unlikely.
Inline candidate callsite is warm. Multiplier increased to 2.
calleeNativeSizeEstimate=525
callsiteNativeSizeEstimate=85
benefit multiplier=2
threshold=170
Native estimate for function size exceeds threshold for inlining 52.5 > 17 (multiplier = 2)
New PGO heuristic (note call site frequency is 1, so no extra boost; also we predict this is a size decreasing inline):
Have profile data for call site...
Inline is profitable: benefit=0.195187 (perCall=0.195187, local=0.195187, global=1, size=-7.3)
Jump targets:
none
Computing inlinee profile scale:
call site count 4975866 callee entry count 5080000 scale 0.979501
assembly for the PGO case below, note size is definitely larger.
; Assembly listing for method Fasta_2:SelectRandom(ref):ubyte
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 7033212
; invoked as altjit
; Final local variable assignments
;
; V00 arg0 [V00,T00] ( 5, 4.72) ref -> rcx class-hnd
; V01 loc0 [V01,T07] ( 2, 2.72) double -> mm0
; V02 loc1 [V02,T04] ( 5, 4.88) int -> rax
; V03 OutArgs [V03 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V04 tmp1 [V04,T01] ( 3, 6 ) int -> r8 "dup spill"
; V05 tmp2 [V05,T02] ( 3, 6 ) int -> r8 "fgInsertCommaFormTemp is creating a new local variable"
; V06 cse0 [V06,T05] ( 3, 4.44) ref -> r8 "CSE - aggressive"
; V07 cse1 [V07,T06] ( 6, 2.72) int -> rdx "CSE - aggressive"
; V08 rat0 [V08,T03] ( 3, 6 ) int -> rdx "ReplaceWithLclVar is creating a new local variable"
;
; Lcl frame size = 40
G_M9962_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
C5F877 vzeroupper
;; bbWeight=1 PerfScore 1.25
G_M9962_IG02: ;; offset=0007H
48B82410FFC1F87F0000 mov rax, 0x7FF8C1FF1024
446900250F0000 imul r8d, dword ptr [rax], 0xF25
4181C085730000 add r8d, 0x7385
BA8156F71D mov edx, 0x1DF75681
8BC2 mov eax, edx
41F7E8 imul edx:eax, r8d
8BC2 mov eax, edx
C1E81F shr eax, 31
C1FA0E sar edx, 14
03C2 add eax, edx
69C0C0220200 imul eax, eax, 0x222C0
442BC0 sub r8d, eax
48B82410FFC1F87F0000 mov rax, 0x7FF8C1FF1024
448900 mov dword ptr [rax], r8d
C5F857C0 vxorps xmm0, xmm0
C4C17B2AC0 vcvtsi2sd xmm0, r8d
C5FB590556000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD00]
33C0 xor eax, eax
8B5108 mov edx, dword ptr [rcx+8]
85D2 test edx, edx
7E1D jle SHORT G_M9962_IG05
;; bbWeight=1 PerfScore 26.83
G_M9962_IG03: ;; offset=0063H
4C63C0 movsxd r8, eax
4E8B44C110 mov r8, gword ptr [rcx+8*r8+16]
C4C17B104808 vmovsd xmm1, qword ptr [r8+8]
C5F92EC8 vucomisd xmm1, xmm0
7721 ja SHORT G_M9962_IG07
;; bbWeight=1.72 PerfScore 10.75
G_M9962_IG04: ;; offset=0077H
FFC0 inc eax
3BD0 cmp edx, eax
7FE6 jg SHORT G_M9962_IG03
;; bbWeight=0.72 PerfScore 1.08
G_M9962_IG05: ;; offset=007DH
8D42FF lea eax, [rdx-1]
3BC2 cmp eax, edx
731D jae SHORT G_M9962_IG09
FFCA dec edx
4863C2 movsxd rax, edx
488B44C110 mov rax, gword ptr [rcx+8*rax+16]
0FB64010 movzx rax, byte ptr [rax+16]
;; bbWeight=0 PerfScore 0.00
G_M9962_IG06: ;; offset=0092H
4883C428 add rsp, 40
C3 ret
;; bbWeight=0 PerfScore 0.00
G_M9962_IG07: ;; offset=0097H
410FB64010 movzx rax, byte ptr [r8+16]
;; bbWeight=1 PerfScore 2.00
G_M9962_IG08: ;; offset=009CH
4883C428 add rsp, 40
C3 ret
;; bbWeight=1 PerfScore 1.25
G_M9962_IG09: ;; offset=00A1H
E8EA3EA15F call CORINFO_HELP_RNGCHKFAIL
CC int3
;; bbWeight=0 PerfScore 0.00
RWD00 dq 3EDDF75680FEB65Fh ; 7.14449017e-06
; Total bytes of code 167, prolog size 7, PerfScore 60.16, instruction count 45 (MethodHash=a6f4d915) for method Fasta_2:SelectRandom(ref):ubyte
Profile data shows this is indeed where all the time is spent: For default:
Jitted code : 97.35% 4.56E+07 samples
02.37% 1.11E+06 ? Unknown
42.75% 2.003E+07 FullOpt [MicroBenchmarks]Fasta_2.SelectRandom(class Frequency[])
40.70% 1.907E+07 Tier-1 [MicroBenchmarks]Fasta_2.random(float64)
09.99% 4.68E+06 FullOpt [MicroBenchmarks]Fasta_2.MakeRandomFasta(class System.String,class System.String,class Frequency[],int32,class System.IO.Stream)
03.84% 1.8E+06 FullOpt [MicroBenchmarks]Fasta_2.MakeRepeatFasta(class System.String,class System.String,unsigned int8[],int32,class System.IO.Stream)
00.13% 6E+04 native ntoskrnl.exe
00.09% 4E+04 native coreclr.dll
and with PGO/inlining
Jitted code : 99.80% 4.47E+07 samples
95.62% 4.278E+07 Tier-1 [MicroBenchmarks]Fasta_2.MakeRandomFasta(class System.String,class System.String,class Frequency[],int32,class System.IO.Stream)
04.00% 1.79E+06 Tier-1 [MicroBenchmarks]Fasta_2.MakeRepeatFasta(class System.String,class System.String,unsigned int8[],int32,class System.IO.Stream)
00.13% 6E+04 native ntoskrnl.exe
00.07% 3E+04 native coreclr.dll
Per BDN the PGO inline version is not consistently faster in cycles. Should revisit once #44370 is merged.
At any rate, PGO inlining does inline the hot methods here.
Looking at instructions retired (per BDN)
;; default
Jit-generated code: 97.13% 2.74E+10 samples
Jitted code : 97.13% 2.74E+10 samples
MinOpts code : 00.00% 0 samples
FullOpts code : 69.78% 1.97E+10 samples
Tier-0 code : 00.00% 0 samples
Tier-1 code : 27.35% 7.72E+09 samples
R2R code : 00.00% 0 samples
02.18% 6.14E+08 ? Unknown
50.69% 1.431E+10 FullOpt [MicroBenchmarks]Fasta_2.SelectRandom(class Frequency[])
27.27% 7.694E+09 Tier-1 [MicroBenchmarks]Fasta_2.random(float64)
11.66% 3.289E+09 FullOpt [MicroBenchmarks]Fasta_2.MakeRandomFasta(class System.String,class System.String,class Frequency[],int32,class System.IO.Stream)
07.42% 2.094E+09 FullOpt [MicroBenchmarks]Fasta_2.MakeRepeatFasta(class System.String,class System.String,unsigned int8[],int32,class System.IO.Stream)
00.67% 1.9E+08 native ntoskrnl.exe
;; TieredPGO / QJFL / Pgo inlining
Jit-generated code: 99.34% 2.43E+10 samples
Jitted code : 99.34% 2.43E+10 samples
MinOpts code : 00.00% 0 samples
FullOpts code : 00.00% 0 samples
Tier-0 code : 00.00% 0 samples
Tier-1 code : 99.34% 2.43E+10 samples
R2R code : 00.00% 0 samples
89.18% 2.177E+10 Tier-1 [MicroBenchmarks]Fasta_2.MakeRandomFasta(class System.String,class System.String,class Frequency[],int32,class System.IO.Stream)
10.02% 2.447E+09 Tier-1 [MicroBenchmarks]Fasta_2.MakeRepeatFasta(class System.String,class System.String,unsigned int8[],int32,class System.IO.Stream)
00.62% 1.52E+08 native ntoskrnl.exe
So about 2.43/(2.73 + 0.06) = 0.87 reduction in instructions, and (in this run) similar reduction in cycles. So we'd expect about a 10% improvement overall.
CqPerf version of this benchmark is very sensitive to in-lining. Forcing inline of random() would cause RyuJIT to beat Legacy Jit64 by 7.4% in execution perf. Model policy in-lines random() method into SelectRandom(). category:cq theme:inlining skill-level:expert cost:large