Open AndyAyersMS opened 4 years ago
And if you want to try this out (x64 only):
COMPlus_TC_QuickJitForLoops=1
COMPlus_TC_OnStackReplacement=1
By default a transition requires ~10,000 executions of a patchpoint. Unlike tiered compilation, transitions are currently synchronous, so there's no lag/delay once this threshold is reached.
To enable "aggressive OSR" -- where an OSR method is created the first time a patchpoint is encountered, also set
COMPlus_TC_OnStackReplacement_InitialCounter=1
COMPlus_OSR_HitLimit=1
To see OSR log messages from the runtime, also add
COMPlus_LogEnable=1
COMPlus_LogFacility=0x00400000
COMPlus_LogLevel=4
COMPlus_LogToConsole=1
try
regionsAs noted above and elsewhere (#35687) the jit has a strong assumption that control can only enter a try
region by branching to the first block in the try. If there's an OSR patchpoint within the try we currently branch directly from the prolog to the middle of the try, breaking this assumption.
The idea here is to add extra control flow to the starts of any try region that encloses the patchpoint, to conditionally transfer control from outside the try to the appropriate point. So all control flow still enters the try at the first block. If we're coming from method entry, we jump to the next enclosing try or the patchpoint as appropriate.
For example, suppose we have the following (Sn
are arbitrary blocks of code):
S0;
try {
S1;
try {
S2;
for (...) {}
}
}
The for
will inspire a patchpoint that is nested within two try
regions. We would transform this as follows:
__firstCall = 1;
goto pt1;
S0;
pt1:
try {
if (_firstCall == 1) goto pt2;
S1;
pt2:
try {
if (_firstCall == 1) goto pp;
S2;
pp:
_firstCall = 0;
for (...)
}
}
Because of the conditional control flow we'll end up importing some Sn
blocks that we currently skip. If it turns out the various Sn
are unreachable (which should be common) we should be able to optimize them away.
Since we can detect the try-nesting of a patchpoint when we're asked to create an OSR method we won't need extra info passed in the patchpoint records from Tier0. We just apply this transformation automatically if we see the patchpoint is within a try.
Note patchpoints within handers are still problematic, largely because of funclets. So if we see loops in catches or finallies we might still want to have the method bypass Tier0. These should be rare.
Started working on the try region fix noted just above. Seems like we ought to wait until after the importer has done its "transitive closure" to add the OSR inspired try region "step blocks" -- if we add them before importation then we will make all the code within tries reachable and this may be hard to undo.
If after importation an enclosing try's entry block is unreachable then the step block can unconditionally transfer to the enclosed try entry and/or actual entry point.
Still checking if it's ok for a try entry to have multiple preds or if we actually expect there to be a unique fall through pred.
Here's an example from the tailrecursetry test case with my current changes main..FixMidTryPatchpoint.
This is adding the new flow pre-importation; as noted above that might not be the best approach overall. Seems like perhaps we should defer this until fgRemoveEmptyBlocks
to avoid wrangling with the importer, and then we can introduce conditional flow only if needed.
;; before (BB10->BB03 is a "mid try entry")
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd weight lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB10 [0009] 1 1 [???..???)-> BB03 (always) i internal
BB01 [0000] 1 1 [000..002) i bwd bwd-target
BB02 [0001] 1 0 1 [002..006)-> BB04 (always) T0 try { keep i try bwd
BB03 [0002] 2 0 1 [006..00E) T0 i bwd bwd-target
BB04 [0003] 2 0 1 [00E..012)-> BB03 ( cond ) T0 i bwd
BB05 [0004] 1 0 1 [012..014)-> BB07 (always) T0 } i bwd
BB06 [0005] 1 0 1 [014..017)-> BB07 ( cret ) H0 catch { } keep i bwd
BB07 [0006] 2 1 [017..01B)-> BB09 ( cond ) i bwd
BB08 [0007] 1 1 [01B..01D) (return) i
BB09 [0008] 1 1 [01D..02D) (return) i
-----------------------------------------------------------------------------------------------------------------------------------------
;; after (BB12 -> BB03 steps in via try entry BB02)
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd weight lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB12 [0011] 1 1 [???..???)-> BB02 (always) i internal
BB01 [0000] 1 1 [000..002) i bwd bwd-target
BB02 [0001] 2 0 1 [???..???)-> BB03 ( cond ) T0 try { keep i internal try bwd
BB11 [0010] 1 0 1 [002..006)-> BB04 (always) T0 keep i bwd
BB03 [0002] 2 0 1 [???..???) T0 i internal bwd bwd-target
BB10 [0009] 1 0 1 [006..00E) T0 i bwd
BB04 [0003] 2 0 1 [00E..012)-> BB03 ( cond ) T0 i bwd
BB05 [0004] 1 0 1 [012..014)-> BB07 (always) T0 } i bwd
BB06 [0005] 1 0 1 [014..017)-> BB07 ( cret ) H0 catch { } keep i bwd
BB07 [0006] 2 1 [017..01B)-> BB09 ( cond ) i bwd
BB08 [0007] 1 1 [01B..01D) (return) i
BB09 [0008] 1 1 [01D..02D) (return) i
-----------------------------------------------------------------------------------------------------------------------------------------
If there's an OSR method that has multiple patchpoints and multiple active threads we can trigger an "OSR storm" of sorts
Compiling 157 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier0
Compiling 158 Microsoft.FSharp.Collections.ArrayModule::Create, IL size = 105, hash=0xb95c3591 Tier0
Compiling 159 Program::direct@15, IL size = 183, hash=0x7e81a925 Tier0
Compiling 160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x150
Compiling 160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x119
Compiling 160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x137
Compiling 160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x21f
Compiling 160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0x25d
Compiling 160 Program::fannkuch, IL size = 708, hash=0x19770f83 Tier1-OSR @0xe5
This seems kind of unfortunate, but not sure there's any easy fix.
One could imagine, say, deferring OSR generation for a method if there's already an OSR method being built at any patchpoint in the method, with the hope that other threads eventually hit and can use the one OSR method being created. But there's no guarantee that any of these threads will ever hit other patchpoints.
(1) Been looking into what would be required to enable QJFL=1 via OSR. There are still some methods that OSR can't handle, and some loops within methods that OSR can't handle.
So the proposal is that with QJFL=1 and OSR=1 (which would become the default on x64, say) and provided the JIT could guarantee escape the Tier0 method via OSR, we'd jit at Tier0. If the JIT can't guarantee this then we instead optimize. So a much larger set of methods would jit at Tier0.
We could reserve QJFL=2 say to mean "always jit at Tier0" regardless, if we wanted such capability.
Bailing out of Tier0+OSR because of the "method" checks this is easy enough as it can use the same pattern we already use for QJFL=0. But for the per-loop failures, it's not as clear how to handle the fallback -- as we need to import to detect the problem.
If these cases are rare enough, it could work like we do for minopts fallback, failing the Tier0 jit request with OSR set; then the driver could request a new jit with OSR not set.
Or we could try and cleanly architect a jit reset that bails from mid-import back to pre-import somehow, but there's a lot of ambient jit state around....
(2) For OSR + {stack,loc}alloc, we'd need to have the jit and gc info support the idea of having 3 pointers into the frame. One would point at the original method's frame base (to access the OSR state), one at the OSR method's frame base (to access any state it introduces, eg spilled temps), and one to refer to the outgoing args area. The separation between these 3 would vary depending on how much allocation was done by the base method before triggering OSR, and how much is done by the OSR method.
Unfortunately the idea that there are at most two frame pointers (FP and SP) is baked in everywhere, including the GC encoding.
Fixing this is straightforward in principle, but also a fair amount of work. And while there are potentially other uses for three frame pointers (eg using the 3rd to reduce the need for large frame-relative offsets), at least initially this support would be exclusive to OSR, and within OSR exclusive to that subset of methods using localloc. So not faring well in terms of cost/benefit.
So thinking now that the right move is to initially exclude these methods from OSR by detecting this as one of the cases we can't handle, and bailing to optimized code (like we do currently for QJFL=0).
(3) Reverse PInvoke methods currently don't seem to be eligible for tiering. At least the UnmanagedCallersOnly
subset. If that ever changes, it seems easy enough for the to support OSR + Reverse PInvoke, but maybe not worth the trouble?
We'd need to update the patchpoint state to track the pinvoke frame var, give its OSR version a suitable location during lvaFixVirtualFrameOffsets
, and update fgAddReversePInvokeEnterExit
to just exit.
Moving it below the cut line for now.
(4) Looked at powershell startup and impact from OSR would be minimal; almost no jitted code is run.
(5) For explicit tail calls. I had thought this was working but didn't realize there was a secondary block.
There are likely two things we need to do here:
Note that enabling OSR for methods with explicit tail calls may not work out very well. Imagine a simple set of mutually tail calling methods -- eventually each call/jump to the method will force an OSR transition. These must be going via call counting stubs so perhaps instead we should just rely on normal tiering here.
Tail recursion is not converted to internal branching so should also be handled by tiering.
So it seems like we should suppress OSR in methods with explicit tail calls but not force them to switch to optimized.
(6) OSR and altjit (specifically cross-altjits)
Two issues:
For cross-altjit, will there ever be correct target-specific patchpoint info? We never ran the cross-altjit generated code.
Yes, the right patchpoint info gets produced by the Tier0 altjit, but then it's thrown away because the altjit reports failure. So presumably we could hold onto this info somehow and supply it to the OSR altjit request. But not clear how to do that.
I guess there's another implicit assumption here, that when an altjit is specified it will process both the Tier0 and any OSR method. Currently that's the case.
Another option is to recognize in the jit that we have an altjit and have the "wrong" patchpoint info, and just make up plausible values for anything host or abi specific. Currently not sure what information we might need.
So, specifically, setPatchpointInfo
could, on the VM side, be associated with the (alt)JIT that produced it, then getOSRInfo
could retrieve that specific information.
I guess the value you're trying to get here is, for example, testing both patchpoint info creation and OSR method consumption of an x64-hosted arm64-targeting altjit instead of requiring doing all OSR testing on arm64 hardware itself. And presumably you have a way to force OSR method creation for all (or some random collection of) patchpoints as a stress mode, so for cross-altjit it wouldn't depend on running (e.g., if we could force these compilations via PMI, that would be good). That does seem like a worthwhile goal.
Right, I'm currently bringing up OSR for Arm64 via cross altjit and would like to get as much mileage as possible out of this mode.
But I'm reluctant to plumb knowledge of the altjit deeper into the runtime so currently not sure how the altjit produced patchpoint info can be stored and retrieved.
Also there would be challenges describing the layout of the cross altjit patchpoint blob for the natively hosted runtime. Currently the data in the patchpoint is opaque to the runtime -- all it cares about is the size of the data. So only the altjit would need to know the exact format.
(7) Initial notes on Arm64 support
Assuming we don't want to constrain the Tier0 frame shape, the biggest challenge is likely to get the OSR method epilog correct -- this requires knowledge of the Tier0 frame shape or possibly related data. Not clear yet what all we will need.
Recall in OSR that the OSR frame sits on top of the Tier0 frame. All callee saves done by Tier0 are "undone" by the transition to the OSR method. When the OSR method is entered, SP points below the Tier0 frame.
The OSR method re-saves the callee saves it uses. So when we go to exit the OSR frame, we first do the normal epilog for the OSR method, and then we need to pop off the Tier0 frame.
Essentially, we need to be able to invoke Arm64's genPopCalleeSavedRegistersAndFreeLclFrame
and only get the SP, FP, and LR adjustment that it does.
For example, on x64, if we have a Tier0 method with epilog:
;; Tier0 method
G_M8788_IG09:
4883C440 add rsp, 64
5D pop rbp
C3 ret
and the OSR method also needs to set up a frame, its epilog will be
;; OSR method
G_M8788_IG07:
;; pop off the OSR frame
4883C420 add rsp, 32
5E pop rsi
;; pop off the Tier0 frame
4883C448 add rsp, 72
5D pop rbp
C3 ret
The arm64 versions of this will be something like:
;; Tier0 method
G_M8788_IG09:
A8C37BFD ldp fp, lr, [sp],#48
D65F03C0 ret lr
;; OSR method (naive, wrong)
G_M8788_IG07:
F9400FF3 ldr x19, [sp,#24]
A8C27BFD ldp fp, lr, [sp],#32
D65F03C0 ret lr
;; OSR method (correct)
G_M8788_IG07:
;; pop off the OSR frame
F9400FF3 ldr x19, [sp,#24]
add sp, sp, #32
;; pop off the Tier0 frame
A8C37BFD ldp fp, lr, [sp],#48
D65F03C0 ret lr
For now, I am going to do the following:
That way I can presumably iterate through the frame type options with cross-altjit and verify each would produce plausible looking code.
Turns out I don't need frame type for arm64 after all. The only dependence is the total tier0 frame size and that the patchpoint-recorded offsets are virtual offsets (which they are). So, an arm64 altjit will produce plausible (if in incorrect) code when run on x64 method context.
Based at #65675 there are 2 benchmarks games tests in the performance repo that would regress with the default strategy (WarmupCount=1
)
BenchmarkDotNet=v0.13.1.1694-nightly, OS=Windows 11 (10.0.22000.493/21H2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
[Host] : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT
Job-AARFFF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-GTOQLX : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-KHECGD : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-CHOGBR : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
IterationTime=250.0000 ms MinIterationCount=15 WarmupCount=1
Method | Toolchain | Mean | Ratio | Diff |
---|---|---|---|---|
MandelBrot_7 | OSR | 66,089.7 us | 1.02 | |
MandelBrot_7 | Main | 64,890.2 us | 1.00 | |
BinaryTrees_2 | Main | 96,001.0 us | 1.00 | |
BinaryTrees_2 | OSR | 114,198.5 us | 1.19 | Regression |
Fasta_2 | OSR | 365.6 us | 0.99 | |
Fasta_2 | Main | 368.3 us | 1.00 | |
NBody_3 | OSR | 363,286.1 us | 1.01 | |
NBody_3 | Main | 360,440.7 us | 1.00 | |
SpectralNorm_1 | OSR | 807.9 us | 0.98 | |
SpectralNorm_1 | Main | 821.9 us | 1.00 | |
FannkuchRedux_5 | OSR | 24,287.8 us | 0.91 | Improvement |
FannkuchRedux_5 | Main | 26,648.5 us | 1.00 | |
FannkuchRedux_2 | OSR | 146,505.7 us | 0.96 | Improvement |
FannkuchRedux_2 | Main | 148,944.7 us | 1.00 | |
FannkuchRedux_9 | OSR | 220,402.3 us | 0.99 | |
FannkuchRedux_9 | Main | 222,524.9 us | 1.00 | |
PiDigits_3 | OSR | 529,273.2 us | 0.99 | |
PiDigits_3 | Main | 535,984.8 us | 1.00 | |
Mandelbrot_2 | OSR | 1,153,379.1 us | 1.01 | |
Mandelbrot_2 | Main | 1,137,558.3 us | 1.00 | |
Fasta_1 | OSR | 230.6 us | 1.01 | |
Fasta_1 | Main | 229.0 us | 1.00 | |
KNucleotide_1 | OSR | 216,297.6 us | 0.99 | |
KNucleotide_1 | Main | 218,618.9 us | 1.00 | |
KNucleotide_9 | OSR | 59,534.4 us | 1.00 | |
KNucleotide_9 | Main | 59,517.3 us | 1.00 | |
ReverseComplement_1 | OSR | 834.4 us | 1.01 | |
ReverseComplement_1 | Main | 823.1 us | 1.00 | |
ReverseComplement_6 | OSR | 30,207.6 us | 0.98 | |
ReverseComplement_6 | Main | 30,780.3 us | 1.00 | |
BinaryTrees_5 | OSR | 117,305.4 us | 0.94 | Improvement |
BinaryTrees_5 | Main | 124,728.5 us | 1.00 | |
BinaryTrees_6 | OSR | 515,972.2 us | 1.01 | |
BinaryTrees_6 | Main | 511,917.2 us | 1.00 | |
SpectralNorm_3 | OSR | 3,287.2 us | 1.21 | Regression |
SpectralNorm_3 | Main | 3,275.5 us | 1.00 | |
RegexRedux_1 | OSR | 45,682.4 us | 0.97 | |
RegexRedux_1 | Main | 46,738.9 us | 1.00 | |
RegexRedux_5 | OSR | 32,008.6 us | 0.99 | |
RegexRedux_5 | Main | 32,357.0 us | 1.00 | |
RegexRedux_5 | OSR | 5,787.7 us | 0.99 | |
RegexRedux_5 | Main | 5,814.0 us | 1.00 |
Here's that same set of benchmarks, this time with --warmupCount=30
to try and ensure BDN is measuring Tier1 code:
BenchmarkDotNet=v0.13.1.1694-nightly, OS=Windows 11 (10.0.22000.493/21H2) Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK=7.0.100-alpha.1.21568.2 [Host] : .NET 6.0.2 (6.0.222.6406), X64 RyuJIT Job-BUXVXU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT Job-UYRIYN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT Job-XQMGDN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT Job-IHENTS : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms MinIterationCount=15 WarmupCount=30
Method | Toolchain | Mean | Median | Ratio | OSR Impact |
---|---|---|---|---|---|
MandelBrot_7 | OSR | 64,545.3 us | 64,427.3 us | 1.01 | |
MandelBrot_7 | Main | 64,007.3 us | 63,941.7 us | 1.00 | |
BinaryTrees_2 | OSR | 91,795.9 us | 90,897.6 us | 0.96 | Improvement |
BinaryTrees_2 | Main | 95,891.1 us | 95,522.7 us | 1.00 | |
Fasta_2 | OSR | 375.9 us | 374.7 us | 1.01 | |
Fasta_2 | Main | 370.7 us | 370.6 us | 1.00 | |
NBody_3 | OSR | 358,882.7 us | 359,285.6 us | 0.97 | Improvement |
NBody_3 | Main | 371,405.4 us | 371,776.2 us | 1.00 | |
SpectralNorm_1 | OSR | 812.7 us | 810.8 us | 1.02 | |
SpectralNorm_1 | Main | 801.3 us | 798.5 us | 1.00 | |
FannkuchRedux_5 | OSR | 23,263.1 us | 23,313.2 us | 0.95 | Improvement |
FannkuchRedux_5 | Main | 25,023.1 us | 25,368.3 us | 1.00 | |
FannkuchRedux_2 | OSR | 156,437.4 us | 156,363.6 us | 1.06 | Regression |
FannkuchRedux_2 | Main | 147,277.7 us | 147,160.0 us | 1.00 | |
FannkuchRedux_9 | OSR | 226,931.0 us | 226,657.6 us | 0.95 | Improvement |
FannkuchRedux_9 | Main | 238,423.3 us | 235,644.1 us | 1.00 | |
PiDigits_3 | OSR | 554,562.8 us | 551,167.6 us | 1.02 | |
PiDigits_3 | Main | 542,265.1 us | 540,793.3 us | 1.00 | |
Mandelbrot_2 | OSR | 1,126,527.5 us | 1,124,591.6 us | 1.00 | |
Mandelbrot_2 | Main | 1,121,005.5 us | 1,120,413.5 us | 1.00 | |
Fasta_1 | OSR | 229.0 us | 228.7 us | 1.00 | |
Fasta_1 | Main | 228.0 us | 227.9 us | 1.00 | |
KNucleotide_1 | OSR | 213,158.8 us | 212,448.6 us | 1.00 | |
KNucleotide_1 | Main | 212,237.7 us | 211,199.3 us | 1.00 | |
KNucleotide_9 | OSR | 61,770.6 us | 61,192.8 us | 1.00 | |
KNucleotide_9 | Main | 61,383.4 us | 61,151.9 us | 1.00 | |
ReverseComplement_1 | OSR | 806.5 us | 805.1 us | 1.00 | |
ReverseComplement_1 | Main | 805.0 us | 804.3 us | 1.00 | |
ReverseComplement_6 | OSR | 30,818.8 us | 31,085.9 us | 1.01 | |
ReverseComplement_6 | Main | 30,650.2 us | 30,900.1 us | 1.00 | |
BinaryTrees_5 | OSR | 122,881.9 us | 124,089.6 us | 0.98 | |
BinaryTrees_5 | Main | 124,920.9 us | 124,950.9 us | 1.00 | |
BinaryTrees_6 | OSR | 525,078.3 us | 524,015.0 us | 1.02 | |
BinaryTrees_6 | Main | 516,264.8 us | 518,383.3 us | 1.00 | |
SpectralNorm_3 | OSR | 3,304.6 us | 3,448.3 us | 1.19 | Regression |
SpectralNorm_3 | Main | 3,201.9 us | 3,473.8 us | 1.00 | |
RegexRedux_1 | OSR | 44,930.3 us | 45,025.3 us | 0.99 | |
RegexRedux_1 | Main | 45,191.1 us | 45,111.4 us | 1.00 | |
RegexRedux_5 | OSR | 31,236.6 us | 31,617.2 us | 0.99 | |
RegexRedux_5 | Main | 31,434.0 us | 31,262.0 us | 1.00 | |
RegexRedux_5 | OSR | 5,722.6 us | 5,698.0 us | 0.95 | Improvement |
RegexRedux_5 | Main | 6,146.9 us | 5,845.8 us | 1.00 |
SpectralNorm_3
regresses in both cases, so that's the first case I'll investigate.
Also note a number of tests change values with additional warmup runs... so would be nice to avoid needing this.
Perf diff in SpectralNorm_3
(w/default warmup=1) disappears if run as admin, also numbers shift quite a bit.
https://github.com/dotnet/performance/issues/2278
Will need to sort this out first.
Looking at BinaryTrees_2
, the analysis over in https://github.com/dotnet/runtime/pull/61934#issuecomment-989186069 suggests the difference is in warmup strategy, but I'll still seeing the regression with one warmup run each.
But the OSR run ends up triggering a Gen2 GC that Main doesn't, and this likely explains the regression. So the question is why we see this extra GC getting triggered; one possibility is that the set of GC roots in the OSR method differs from Tier1?
the set of GC roots in the OSR method differs from Tier1?
That seems to be the case, and the culprit is (lack of) struct promotion in OSR for "osr locals". We hold onto a tree that we don't need to be holding onto, because the struct field is untracked.
Naively enabling struct promotion seems to fix this -- GC behavior seems to match, though the OSR version is still consistently a bit slower, maybe 3%? -- but I will need to look carefully at how to tell when/if promotion is unsafe. The challenge is that the OSR compile doesn't see all the code and so a struct field alias could be introduced in the Tier0 portion of the method, and we might miss detecting aliases properly in the OSR version.
Method | Job | Toolchain | Mean | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BinaryTrees_2 | Job-PZNDTP | \main\corerun.exe | 94.51 ms | 1.848 ms | 1.897 ms | 94.18 ms | 92.15 ms | 98.12 ms | 1.00 | 0.00 | 38000.0000 | 1000.0000 | 227.33 MB | 1.00 |
BinaryTrees_2 | Job-RPQAKV | \osr\corerun.exe | 98.60 ms |
One thought here is that instead of trying to be clever in detecting aliasing, we might fall back to full importation for OSR if we see IL that potentially can introduce local aliases (ldflda, ldarga, ldloca) and make sure that we don't trim out the unreachable code until after we've done promotion analysis.
Re the 3-5% slowdown we still see in BinaryTrees_2
-- codegen for OSR method is now virtually identical to eager full opt. Perf runs are very noisy, though OSR is consistently a bit slower. Will see if I can figure out why.
Looking at Benchstones.BenchI.Array2
-- here we have 4 level deep nested loop. OSR is unable to get good perf.
Method | Job | Toolchain | Mean |
---|---|---|---|
Array2 | Job-HUISAQ | \main\corerun.exe | 366.0 ms |
Array2 | Job-QFPZWJ | \osr\corerun.exe | 906.9 ms |
Main culprit seems to be lack of loop cloning. Without this the inner loop has a lot of bounds checking. With suitable warmup everything is fine (as we're in Tier1).
What seems to be happening is that the OSR method has non-standard loop flow, and this breaks one or more of
For a simpler example, consider this method with a hoistable invariant:
using System;
using System.Runtime.CompilerServices;
class X
{
[MethodImpl(MethodImplOptions.NoInlining)]
static int D() => 10;
public static int Main()
{
int d = D();
int r = 0;
for (int i = 0; i < 1_000_000; i++)
{
r += d * d;
}
return r / 1_000_000;
}
}
Currently OSR will produce this inner loop:
;; target patchpoints
G_M46779_IG02: ;; offset=000CH
448BC1 mov r8d, ecx
440FAFC1 imul r8d, ecx
4103D0 add edx, r8d
FFC0 inc eax
3D40420F00 cmp eax, 0xF4240
7CED jl SHORT G_M46779_IG02
As a mitigation we can switch the patchpoints so they are located at the sources of backedges, rather than the targets. Doing this gives us the expected hoist:
;; source patchpoints
G_M46779_IG02: ;; offset=000DH
81F940420F00 cmp ecx, 0xF4240
7D18 jge SHORT G_M46779_IG04
450FAFC0 imul r8d, r8d
0F1F8000000000 align [7 bytes for IG03]
G_M46779_IG03: ;; offset=0020H
4103D0 add edx, r8d
FFC1 inc ecx
81F940420F00 cmp ecx, 0xF4240
7CF3 jl SHORT G_M46779_IG03
However, if we modify the example to have a nested loop
using System;
using System.Runtime.CompilerServices;
class X
{
[MethodImpl(MethodImplOptions.NoInlining)]
static int D() => 10;
public static int Main()
{
int d = D();
int r = 0;
for (int i = 0; i < 1_000; i++)
{
for (int j = 0; j < 1_000; j++)
{
r += d * d;
}
}
return r / 1_000_000;
}
}
then source patchpoints end up worse off overall. Target patchpoints enable hoisting:
;; target patchpoints
G_M46779_IG02: ;; offset=0011H
450FAFC0 imul r8d, r8d
EB02 jmp SHORT G_M46779_IG04
;; bbWeight=1 PerfScore 4.00
G_M46779_IG03: ;; offset=0017H
33C0 xor eax, eax
;; bbWeight=2 PerfScore 0.50
G_M46779_IG04: ;; offset=0019H
4103D0 add edx, r8d
FFC0 inc eax
3DE8030000 cmp eax, 0x3E8
7CF4 jl SHORT G_M46779_IG04
;; bbWeight=64 PerfScore 112.00
G_M46779_IG05: ;; offset=0025H
FFC1 inc ecx
81F9E8030000 cmp ecx, 0x3E8
7CE8 jl SHORT G_M46779_IG03
but with source patchpoints, we won't hoist out of the inner loop:
;; source patchpoints
G_M46779_IG03: ;; offset=0013H
33C0 xor eax, eax
EB07 jmp SHORT G_M46779_IG05
;; bbWeight=2 PerfScore 4.50
G_M46779_IG04: ;; offset=0017H
3DE8030000 cmp eax, 0x3E8
7D13 jge SHORT G_M46779_IG06
;; bbWeight=1 PerfScore 1.25
G_M46779_IG05: ;; offset=001EH
448BC9 mov r9d, ecx
440FAFC9 imul r9d, ecx
4103D1 add edx, r9d
FFC0 inc eax
3DE8030000 cmp eax, 0x3E8
7CED jl SHORT G_M46779_IG05
;; bbWeight=16 PerfScore 64.00
G_M46779_IG06: ;; offset=0031H
41FFC0 inc r8d
4181F8E8030000 cmp r8d, 0x3E8
7CD6 jl SHORT G_M46779_IG03
What goes wrong? In the target case we invert the inner loop condition (outer loop is unconditionally entered, thanks to OSR). We then recognize the outer loop but not the inner loop:
;; target
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0007] 1 1 [???..???)-> BB03 (always) i internal
BB02 [0001] 1 BB04 0.50 [00C..010)-> BB04 ( cond ) i bwd bwd-target
BB03 [0002] 3 BB01,BB02,BB03 1 [010..022)-> BB03 ( cond ) i bwd bwd-target osr-entry
BB04 [0004] 2 BB02,BB03 1 [022..02E)-> BB02 ( cond ) i bwd
BB05 [0006] 1 BB04 1 [02E..036) (return) i
-----------------------------------------------------------------------------------------------------------------------------------------
*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
2 loop heads marked
*************** In optFindNaturalLoops()
New loop epoch 1
Recorded loop L00, from BB02 to BB04 (Head=BB01, Entry=BB03, Exit=BB04)
*************** Natural loop table
L00, from BB02 to BB04 (Head=BB01, Entry=BB03, Exit=BB04)
*************** In optFindAndScaleGeneralLoopBlocks()
Marking a loop from BB02 to BB04
BB02(wt=200)
BB03(wt=800)
BB04(wt=800)
Marking a loop from BB03 to BB03
BB03(wt=6400)
Found a total of 2 general loops.
Skip alignment for L00 that starts at BB02 weight=200.
In the source case we also invert the inner loop, but then fail to recognize either loop. And because we don't recognize any loops, we don't hoist.
;; source
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0007] 1 1 [???..???)-> BB04 (always) i internal
BB02 [0001] 1 BB06 0.50 [00C..010)-> BB06 ( cond ) i bwd bwd-target
BB03 [0009] 1 BB02 0.50 [???..???)-> BB05 (always) internal
BB04 [0008] 1 BB01 1 [???..???)-> BB06 ( cond ) internal
BB05 [0002] 3 BB03,BB04,BB05 0.50 [010..022)-> BB05 ( cond ) i bwd bwd-target
BB06 [0004] 3 BB02,BB04,BB05 1 [022..02E)-> BB02 ( cond ) i bwd
BB07 [0006] 1 BB06 1 [02E..036) (return) i
-----------------------------------------------------------------------------------------------------------------------------------------
*************** Starting PHASE Find loops
*************** In optFindLoops()
*************** In optMarkLoopHeads()
2 loop heads marked
*************** In optFindNaturalLoops()
*************** In optFindAndScaleGeneralLoopBlocks()
Marking a loop from BB02 to BB06
BB02(wt=200)
BB03(wt=200)
BB04(wt=100); unchanged: block not in loop
BB05(wt=200)
BB06(wt=800)
Marking a loop from BB05 to BB05
BB05(wt=1600)
Found a total of 2 general loops.
Not clear yet how best to address this.
cc @BruceForstall
A bit of progress, perhaps.
I have some provisional changes now that (in conjunction with using source patchpoints) gets loop recognition working better for OSR. The main culprit here has been fgOptimizeUncondBranchToSimpleCond
kicking in upstream of loop recognition and adding a second predecessor to some loops.
That change lets us recognize loops and enable some hoisting.
With that in place I see some improvement in OSR because we can hoist. That fixes the nested case above, and offers some improvement to Array2
:
Method | Mean |
---|---|
Array2 | 715.7 ms |
But `Array2 is still ~2x slower than full opt, because of bounds checks.
main
fixes this by cloning. For OSR we fail to set LPFLAG_ITER
and so won't clone. The OSR entered loop won't have normal initialization, and any enclosing loop will look like it's multiple entry. I have tweaked the cloning code to consider cloning if the initialization can't be found, by just using the iterator local to initialize itself. Not sure if this is completely kosher but seems like it might be since for variable init we should be doing a two-sided test for any array index involving that loop index variable. But the innermost loop checks are too costly so we won't clone, the next two enclosing loops are not LPFLAG_ITER
because of side entries, and the topmost loop is not LPFLAG_ITER
because it lacks initialization.
Incidentally in Benchstones.BenchI.Array2
on main we don't recognize the outermost loop as iteratable either. It is a do-while with no initialization. So we clone the second level loop instead of the top-level loop, which is a bit suboptimal. Seems like it should be tolerable as long as the controlling variable doesn't enter into any bounds checks.
Note blocking fgOptimizeUncondBranchToSimpleCond
in this way generally seems to be a regression, so perhaps (w/o OSR) it is less likely to mess up loop recognition? I can just block it for OSR and/or see if I can refine the condition a bit better so that we're only applying it to the loop targeted by the OSR entry.
The loop cloning change (to be a bit more permissive about initialization) seems viable on its own but doesn't unblock OSR so will set it aside for now.
A bigger fix would be to support loop cloning "along the backedge", which is what OSR would need. We're already in the loop and want to escape to the cloned version. Here's how it could work for the case above:
![image - 2022-03-01T091036 474](https://user-images.githubusercontent.com/10121823/156216286-c8a91dd1-e93e-4176-b1af-ef8c1f72d052.png) | ![image - 2022-03-01T091424 734](https://user-images.githubusercontent.com/10121823/156216418-642f40d9-1f1e-47f6-9dcf-9ebb5abbcb81.png) |
This might be unappealing because if we can't ever use the clone we'll still try jumping to the optimized version each iteration. Instead we can clone two copies and have the clone condition pick between them. But now we've made two copies.
It might be simpler / more feasible to hoist the clone checks out and produce something more like the following (here x
is the fast path set of loop blocks):
The drawback here is that we may need to think about how to hoist things out of the loop targeted by the OSR entry (BB03x-BB05x in the above). but we can fix that too, if we're willing to add suitable preheaders and adjust for the entry being in the middle of the loop, not the top.
IniArray
is another test where OSR perf lags. Here the full opt version is able to eliminate a bounds check because it sees an array newobj; the OSR method does not see the newobj. We could recover perf if we cloned, but we can't clone right now for similar reasons as above.
Loop nest
;; full opts
G_M30103_IG03: ;; offset=001AH
33C9 xor ecx, ecx
0F1F4000 align [4 bytes for IG04]
;; bbWeight=4 PerfScore 2.00
G_M30103_IG04: ;; offset=0020H
448BC1 mov r8d, ecx
6642C74440102000 mov word ptr [rax+2*r8+16], 32
FFC1 inc ecx
83F910 cmp ecx, 16
7CEE jl SHORT G_M30103_IG04
;; bbWeight=15.84 PerfScore 43.56
G_M30103_IG05: ;; offset=0032H
FFC2 inc edx
81FA80969800 cmp edx, 0x989680
7CDE jl SHORT G_M30103_IG03
;; osr
G_M30103_IG02: ;; offset=0011H
448B4208 mov r8d, dword ptr [rdx+8]
EB02 jmp SHORT G_M30103_IG04 ;; "side entry"
;; bbWeight=1 PerfScore 4.00
G_M30103_IG03: ;; offset=0017H
33C0 xor eax, eax
;; bbWeight=2 PerfScore 0.50
G_M30103_IG04: ;; offset=0019H
413BC0 cmp eax, r8d
7328 jae SHORT G_M30103_IG08
448BC8 mov r9d, eax
6642C7444A102000 mov word ptr [rdx+2*r9+16], 32
FFC0 inc eax
83F810 cmp eax, 16
7CE9 jl SHORT G_M30103_IG04
;; bbWeight=64 PerfScore 256.00
G_M30103_IG05: ;; offset=0030H
FFC1 inc ecx
81F980969800 cmp ecx, 0x989680
7CDD jl SHORT G_M30103_IG03
portion of jit dump from OSR with some extra debug spew I added to surface failures to classify:
*************** In optMarkLoopHeads()
2 loop heads marked
*************** In optFindNaturalLoops()
@ Contemplating possible loop: head BB01 top BB02 bottom BB04
Entry is BB03
Finished cycle detect
New loop epoch 1
head empty
... no test incr
Recorded loop L00, from BB02 to BB04 (Head=BB01, Entry=BB03, Exit=BB04)
@ Contemplating possible loop: head BB02 top BB03 bottom BB01
Nope, bottom->top is not bbnum backwards
@ Contemplating possible loop: head BB02 top BB03 bottom BB02
Nope, bottom->top is not bbnum backwards
@ Contemplating possible loop: head BB02 top BB03 bottom BB03
Entry is BB03
Multiple entries BB03 and BB01
Nope, no cycle
@ Contemplating possible loop: head BB03 top BB04 bottom BB02
Nope, bottom->top is not bbnum backwards
@ Contemplating possible loop: head BB03 top BB04 bottom BB03
Nope, bottom->top is not bbnum backwards
@ Contemplating possible loop: head BB04 top BB05 bottom BB04
Nope, bottom->top is not bbnum backwards
*************** Natural loop table
L00, from BB02 to BB04 (Head=BB01, Entry=BB03, Exit=BB04)
*************** In optFindAndScaleGeneralLoopBlocks()
Marking a loop from BB02 to BB04
BB02(wt=200)
BB03(wt=800)
BB04(wt=800)
Marking a loop from BB03 to BB03
BB03(wt=6400)
Found a total of 2 general loops.
Skip alignment for L00 that starts at BB02 weight=200.
So with OSR we recognize the outer loop but don't think it is iterable; we dont' recognize the inner loop because of the side entry so don't try and align. And later we don't clone outer because it is not iterable:
Considering loop L00 to clone for optimizations.
Loop cloning: rejecting loop L00. No LPFLG_ITER flag.
Hence the OSR code above -- bounds check in inner loop, and inner loop not aligned.
BDN manages to get to Tier1 code here, eventually ... note the mean vs median discrepancy below:
Method | Job | Toolchain | Mean | Error | StdDev | Median |
---|---|---|---|---|---|---|
IniArray | Job-GWBJMD | \main\corerun.exe | 40.45 ms | 0.664 ms | 0.621 ms | 40.50 ms |
IniArray | Job-TVGDYP | \osr\corerun.exe | 44.46 ms | 5.662 ms | 6.521 ms | 40.35 ms |
Looking at RegexIntepreter.Go
for some benchmarks, we have an unusual loop shape with a switch at loop top and many backedges. If we put patchpoints at the backedge sources we may be less aggressive about triggering OSR (and put a lot more patchpoints) then if we put a patchpoint at the target.
That suggests that perhaps we want to vary strategy depending on the number of backedge preds of the target:
fwiw, some additional comments on Array2 are here: https://github.com/dotnet/runtime/issues/54074
It looks like cloning wants to know the initialization condition so it can check that the initial value is >= 0. With a constant initialization it can check that statically, otherwise it checks it dynamically. Why? Because the iteration variable is being used as an array index and we want to eliminate bounds checks, so we need index_start >= 0 && loop_index_limit <= array_length
(for <
loops) to ensure no bounds check is needed.
(the constant init value is also needed for unrolling, to statically compute the number of iterations)
The example of "cloning on the back-edge" (https://github.com/dotnet/runtime/issues/33658#issuecomment-1055671498) would create a case where the "slow path" would always execute all the cloning conditions on every iteration, which will make it much slower. (Of course, we always assume the slow path after cloning will throw an exception.)
In the example above, does the inner loop at BB03 not get cloned because it is has multiple predecessors? (BB01, BB02)
In the example above, does the inner loop at BB03 not get cloned because it is has multiple predecessors? (BB01, BB02)
You mean the simple nested loop case?
With OSR enabled (and adaptive strategy from #66208) loop recognition finds both loops but neither is LPFLG_ITER
and so we don't consider either loop for cloning.
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0007] 1 1 [???..???)-> BB03 (always) i internal
BB02 [0001] 1 BB05 0.50 [00C..010) i bwd bwd-target
BB03 [0008] 2 BB01,BB02 1 [???..???)-> BB05 ( cond ) internal
BB04 [0002] 2 BB03,BB04 0.50 [010..022)-> BB04 ( cond ) i bwd bwd-target
BB05 [0004] 2 BB03,BB04 1 [022..02E)-> BB02 ( cond ) i bwd
BB06 [0006] 1 BB05 1 [02E..036) (return) i
-----------------------------------------------------------------------------------------------------------------------------------------
L00, from BB02 to BB05 (Head=BB01, Entry=BB03, Exit=BB05), child loop = L01
L01, from BB04 to BB04 (Head=BB03, Entry=BB04, Exit=BB04, parent=L00)
Considering loop L00 to clone for optimizations.
Loop cloning: rejecting loop L00. No LPFLG_ITER flag.
------------------------------------------------------------
Considering loop L01 to clone for optimizations.
Loop cloning: rejecting loop L01. No LPFLG_ITER flag.
------------------------------------------------------------
No clonable loops
Outer loop iteration search fails in optExtractInitTestIncr
and inner loop in optPopulateInitInfo
.
For the inner loop, BB03
is just the test; the initialization of V03
is up in the Tier0 method and in BB02
.
***** BB03
STMT00007 ( 0x01A[E-] ... ??? )
( 7, 9) [000032] ------------ * JTRUE void
( 5, 7) [000033] J------N---- \--* GE int
( 3, 2) [000034] ------------ +--* LCL_VAR int V03 loc3
( 1, 4) [000035] ------------ \--* CNS_INT int 0x3E8
I suppose there is a sticking point here in being confident that the conditions that hold at OSR entry (BB01->BB04
) are also going to hold when the inner loop is re-entered via BB02->BB03->BB04
so just one set of tests covers both. I believe that is the case, but I suppose we also need to show that the initial values for the loop iters are nest invariants.
The example of "cloning on the back-edge"
Yeah, for OSR I am hoping we can avoid repeatedly checking and have the OSR entry decide between fast and slow nest (as in this picture).
Current output of SOS's ip2md
when looking at OSR code:
Need to update dac/sos to understand what OSR code means.
BDN version in perf repo was updated to include aggressive opt on the Overhead/Workload action methods (via https://github.com/dotnet/performance/pull/2307).
Some of the tests I expected to improve actually improve, eg
But not all:
BDN version in perf repo was updated to include aggressive opt on the Overhead/Workload action methods (via dotnet/performance#2307).
Via this fix: https://github.com/dotnet/BenchmarkDotNet/pull/1935
Actually, some care is needed in interpreting these results, as the Mar1 "baseline" is NoPGO, not default. And in some cases PGO data is causing regressions. For example, here's the default perf for GetBytes(Input: Greek)
. Note it is around 220k, which is similar to what we see with OSR above.
There's too much here to sort through by hand. @DrewScoggins looks like I could use an updated comparison report...
Here's an interesting case:
This test regresses with OSR / Tier1 because we don't align the loop:
;; w/o OSR
7E22 jle SHORT G_M30591_IG04
0F1F80000000000F1F840000000000 align [15 bytes for IG03]
;; bbWeight=1 PerfScore 7.75
G_M30591_IG03: ;; offset=0040H
448BC2 mov r8d, edx
F34E0FB844C010 popcnt r8, qword ptr [rax+8*r8+16]
4103F0 add esi, r8d
FFC2 inc edx
3BCA cmp ecx, edx
7FED jg SHORT G_M30591_IG03
;; w/ OSR
7E14 jle SHORT G_M30591_IG04
G_M30591_IG03: ;; offset=001AH
448BC9 mov r9d, ecx
F34E0FB84CCA10 popcnt r9, qword ptr [rdx+8*r9+16]
4103C1 add eax, r9d
FFC1 inc ecx
443BC1 cmp r8d, ecx
7FEC jg SHORT G_M30591_IG03
We don't align because w/OSR or at Tier1 the loop body comes from an inlinee with PGO data, inlined into a caller without PGO data. We scale the profile during inlining, then downscale during optSetBlockWeights
, then don't upscale during optScaleLoopBlocks
. So the loop body's weight ends up at 0.5 and we don't think this merits alignment.
;; w/o OSR
Trees after Find loops
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..00C)-> BB03 ( cond ) i hascall idxlen
BB02 [0001] 2 BB01,BB02 4 0 [00C..021)-> BB02 ( cond ) i Loop idxlen bwd bwd-target align
BB03 [0003] 2 BB01,BB02 1 [021..023) (return) i
-----------------------------------------------------------------------------------------------------------------------------------------
;; w OSR (or Tier1)
Trees after Find loops
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight IBC lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..00C)-> BB03 ( cond ) i idxlen
BB02 [0001] 2 BB01,BB02 0.50 50 0 [00C..021)-> BB02 ( cond ) i Loop idxlen bwd bwd-target IBC
BB03 [0003] 2 BB01,BB02 1 [021..023) (return) i
-----------------------------------------------------------------------------------------------------------------------------------------
Without OSR, we initially try to jit the method at Tier0, and see there's a loop, and switch to FullOpts. Because of this we don't have the BBOPT
flag and so the jit doesn't try to read profile data for the inlinee.
This is unfortunate because we're not pulling in any profile data for these eagerly optimized methods. It may also partially explain why we've seen less impact from default PGO than we might expect -- the method where we spend the most time are not using any PGO data, by default.
There are two avenues to pursue here (perhaps together):
then downscale during
optSetBlockWeights
I wonder if the loop body of inverted loops always gets downscaled because the zero trip test is "given" half the flow. Seems bad.
then don't upscale during optScaleLoopBlocks
Did you look why that is?
then downscale during
optSetBlockWeights
I wonder if the loop body of inverted loops always gets downscaled because the zero trip test is "given" half the flow. Seems bad.
Yeah this seems a bit odd. Weights for loop blocks in inverted loops go from 1 to 0.5 to 4. Generally we should assume such loops rarely zero-trip.
then don't upscale during optScaleLoopBlocks
Did you look why that is?
We don't upscale blocks with profile data.
All this is a consequence of our rather simplistic weighting/reweighting approach.
These problems get worse with PGO as we do a poor job of handling mixtures of profiled and unprofiled code (say inlining profiled callees into an unprofiled caller, or vice versa). OSR is highlighting some of this as (because of the inadvertent lack of BBOPT in the switch-to-optimized case) we are using PGO data "more often".
I have ambitions to redo all this but thought we could get by with the current scheme for a while longer.
Looks like optSetBlockWeights
won't scale weights if the entire function is using profile data, i.e., fgIsUsingProfileWeights()
is true. Looks like it should maybe not scale if a particular block has profile weights, the way optScaleLoopBlocks
does:
if (curBlk->hasProfileWeight())
{
reportBlockWeight(curBlk, "; unchanged: has profile weight");
continue;
}
?
Looking at the SocketsHttpHandlerPerfTest.Get(ssl: true, ...)
test, it seems like most of the time BDN chooses a strange iteration strategy where it skips the pilot and overhead stages and just runs the benchmark once per workload iteration. When that happens we get inaccurate measurements.
Note how operations
is frequently very low. This seems to be some kind of issue with BDN that is exposed/exacerbated by using OSR but (given past history) seemingly can happen without OSR as well. Not sure what just yet.
Seems to be related to https://github.com/dotnet/BenchmarkDotNet/issues/1780 where if the initial single invocation takes a long enough time then BDN has various threshold tests to decide if it should skip the pilot stages and just run one iteration per workload interval.
In my case the run of main
triggers the "second chance" heuristic and the second time around the runs are fast, so BDN decides to iterate a lot. But for OSR the initial jitting is below the magic threshold but still somehow high enough to preclude pilot runs, and BDN decides to just run one iteration. And this benchmark has some high first-run costs and that change in strategy totally skews the results.
;; main
OverheadJitting 1: 1 op, 281000.00 ns, 281.0000 us/op
WorkloadJitting 1: 1 op, 358407800.00 ns, 358.4078 ms/op // exceeds 250 ms
OverheadJitting 2: 1 op, 100.00 ns, 100.0000 ns/op // "second chance" logic
WorkloadJitting 2: 1 op, 2746000.00 ns, 2.7460 ms/op
OverheadJitting 3: 16 op, 404800.00 ns, 25.3000 us/op
WorkloadJitting 3: 16 op, 3052100.00 ns, 190.7562 us/op
WorkloadPilot 1: 16 op, 2517600.00 ns, 157.3500 us/op // ok, let's compute iterations
WorkloadPilot 2: 1600 op, 149864700.00 ns, 93.6654 us/op
WorkloadPilot 3: 2672 op, 256163700.00 ns, 95.8696 us/op
WorkloadPilot 4: 2608 op, 197508100.00 ns, 75.7316 us/op
WorkloadPilot 5: 3312 op, 236875800.00 ns, 71.5205 us/op
WorkloadPilot 6: 3504 op, 250721500.00 ns, 71.5529 us/op
;;; OSR
OverheadJitting 1: 1 op, 270400.00 ns, 270.4000 us/op
WorkloadJitting 1: 1 op, 227972100.00 ns, 227.9721 ms/op // does not exceed 250ms? Seems like we should pilot. But no.
WorkloadWarmup 1: 1 op, 2733600.00 ns, 2.7336 ms/op
// BeforeActualRun
WorkloadActual 1: 1 op, 228100.00 ns, 228.1000 us/op // just measure one invocation per
WorkloadActual 2: 1 op, 212200.00 ns, 212.2000 us/op
WorkloadActual 3: 1 op, 162400.00 ns, 162.4000 us/op
Method | Job | Toolchain | ssl | chunkedResponse | responseLength | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen 0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Get | Job-ERLBPQ | \main-rel\corerun.exe | True | False | 1 | 70.82 us | 0.789 us | 0.700 us | 70.66 us | 70.01 us | 72.41 us | 1.00 | 0.00 | 0.2854 | 2.6 KB | 1.00 |
Get | Job-IQYELO | \osr-rel\corerun.exe | True | False | 1 | 159.31 us | 26.011 us | 28.911 us | 152.20 us | 117.60 us | 228.10 us | 2.37 | 0.38 | - | 3.59 KB | 1.38 |
// Warnings MinIterationTime SocketsHttpHandlerPerfTest.Get: PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog, Toolchain=\osr-rel\corerun.exe, IterationTime=250.0000 ms, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1 -> The minimum observed iteration time is 117.6000 us which is very small. It's recommended to increase it to at least 100.0000 ms using more operations.
I guess it's the second bit of BDN logic -- if desiredterationTime / singleIterationTime < 1.5
then BDN will still do a single iteration. Here we have 250 / 227.97221 == 1.0966 < 1.5
.
So OSR is triggering this behavior by having less initial overhead than MAIN but not so much less that BDN decides it should run pilot stages to assess how many iterations are needed.
Here's a case where I thought a regression was caused by OSR, but it's largely there in main too. Just unfortunate timing.
The OSR case numbers improved with the recent (3/12) update to BDN and is now on par with MAIN.
OSR
MAIN
Have an interim comparison report showing OSR vs MAIN. The top dozen or so regressions are the ssl: true
tests, these all should be "fixed" by the next BDN update in the performance repo: https://github.com/dotnet/performance/pull/2323
Overall this report shows 395 regressions and 591 improvements. The vast majority of these are noise. Drew is going to get me an updated comparison report that takes history into account; when that shows up I'll post an update.
In the meantime looking through what's there for actual regressions, Benchstone.BenchF.NewtR.Test
is slower with OSR. This benchmark runs 4 iterations per, so ends up executing OSR for about 1/3 of its workload iterations. The OSR codegen is similar but the inner loop has a couple of extra loads because of an OSR exposed local.
I actually get somewhat inconsistent results locally. And not clear yet why we can't hoist/cse this load. Need to look more closely what's going on.
;; FullOpt/Tier1 inner loop
align [0 bytes for IG04]
G_M46661_IG04: ;; offset=0030H
C5CB59D6 vmulsd xmm2, xmm6, xmm6
C5FB101DE4000000 vmovsd xmm3, qword ptr [reloc @RWD00]
C5E35CD2 vsubsd xmm2, xmm3, xmm2
C5EB59D6 vmulsd xmm2, xmm2, xmm6
C5FB5CD2 vsubsd xmm2, xmm0, xmm2
C5E8541DF0000000 vandps xmm3, xmm2, qword ptr [reloc @RWD32]
C5F92ECB vucomisd xmm1, xmm3
7746 ja SHORT G_M46661_IG06
C5CB591DF2000000 vmulsd xmm3, xmm6, qword ptr [reloc @RWD48]
C5E359DE vmulsd xmm3, xmm3, xmm6
C5E35C1DB6000000 vsubsd xmm3, xmm3, qword ptr [reloc @RWD00]
C5D857E4 vxorps xmm4, xmm4
C5F92EDC vucomisd xmm3, xmm4
7A02 jp SHORT G_M46661_IG05
7421 je SHORT G_M46661_IG06
;; bbWeight=16 PerfScore 597.33
G_M46661_IG05: ;; offset=0076H
C5EB5ED3 vdivsd xmm2, xmm2, xmm3
C5CB5CF2 vsubsd xmm6, xmm6, xmm2
C5E85415BA000000 vandps xmm2, xmm2, qword ptr [reloc @RWD32]
C5F92ECA vucomisd xmm1, xmm2
7707 ja SHORT G_M46661_IG06
FFC0 inc eax
83F80A cmp eax, 10
7E9D jle SHORT G_M46661_IG04
;; OSR inner loop
align [0 bytes for IG04]
G_M46661_IG04: ;; offset=0066H
C5FB109424A0000000 vmovsd xmm2, qword ptr [rsp+A0H] // extra
C5EB59DA vmulsd xmm3, xmm2, xmm2
C5FB1025C5010000 vmovsd xmm4, qword ptr [reloc @RWD16]
C5DB5CDB vsubsd xmm3, xmm4, xmm3
C5E359D2 vmulsd xmm2, xmm3, xmm2
C5FB5CD2 vsubsd xmm2, xmm0, xmm2
C5E8541DC1010000 vandps xmm3, xmm2, qword ptr [reloc @RWD32]
C5F92ECB vucomisd xmm1, xmm3
7768 ja SHORT G_M46661_IG06
C5FB109C24A0000000 vmovsd xmm3, qword ptr [rsp+A0H] // extra
C5E35925BA010000 vmulsd xmm4, xmm3, qword ptr [reloc @RWD48]
C5DB59DB vmulsd xmm3, xmm4, xmm3
C5E35C1D8E010000 vsubsd xmm3, xmm3, qword ptr [reloc @RWD16]
C5D857E4 vxorps xmm4, xmm4
C5F92EDC vucomisd xmm3, xmm4
7A02 jp SHORT G_M46661_IG05
7439 je SHORT G_M46661_IG06
;; bbWeight=16 PerfScore 693.33
G_M46661_IG05: ;; offset=00BEH
C5EB5ED3 vdivsd xmm2, xmm2, xmm3
C5FB109C24A0000000 vmovsd xmm3, qword ptr [rsp+A0H] // extra
C5E35CDA vsubsd xmm3, xmm3, xmm2
C5FB119C24A0000000 vmovsd qword ptr [rsp+A0H], xmm3
C5E8541570010000 vandps xmm2, xmm2, qword ptr [reloc @RWD32]
C5F92ECA vucomisd xmm1, xmm2
770B ja SHORT G_M46661_IG06
FFC3 inc ebx
83FB0A cmp ebx, 10
0F8E75FFFFFF jle G_M46661_IG04
;; OSR iterations
// AfterActualRun
WorkloadResult 1: 4 op, 369514200.00 ns, 92.3786 ms/op
WorkloadResult 2: 4 op, 361280500.00 ns, 90.3201 ms/op
WorkloadResult 3: 4 op, 357723200.00 ns, 89.4308 ms/op
WorkloadResult 4: 4 op, 361092200.00 ns, 90.2730 ms/op
WorkloadResult 5: 4 op, 355732600.00 ns, 88.9331 ms/op
WorkloadResult 6: 4 op, 352204300.00 ns, 88.0511 ms/op
WorkloadResult 7: 4 op, 299682100.00 ns, 74.9205 ms/op
WorkloadResult 8: 4 op, 282106800.00 ns, 70.5267 ms/op
WorkloadResult 9: 4 op, 283662600.00 ns, 70.9156 ms/op
WorkloadResult 10: 4 op, 283777300.00 ns, 70.9443 ms/op
WorkloadResult 11: 4 op, 284411100.00 ns, 71.1028 ms/op
WorkloadResult 12: 4 op, 288849500.00 ns, 72.2124 ms/op
WorkloadResult 13: 4 op, 281122300.00 ns, 70.2806 ms/op
WorkloadResult 14: 4 op, 281138100.00 ns, 70.2845 ms/op
WorkloadResult 15: 4 op, 282122600.00 ns, 70.5306 ms/op
WorkloadResult 16: 4 op, 286633300.00 ns, 71.6583 ms/op
WorkloadResult 17: 4 op, 286545500.00 ns, 71.6364 ms/op
WorkloadResult 18: 4 op, 282719600.00 ns, 70.6799 ms/op
WorkloadResult 19: 4 op, 284713200.00 ns, 71.1783 ms/op
WorkloadResult 20: 4 op, 285504900.00 ns, 71.3762 ms/op
In the Tier0 method, x0
is address exposed, so the OSR method has to reload it from memory each time.
The exact impact of this depends on when the Tier1 method shows up -- in the above you can see it kicked in around iteration 7.
not clear yet why we can't hoist/cse this load. Need to look more closely what's going on.
C5FB119C24A0000000 vmovsd qword ptr [rsp+A0H], xmm3
x0
is modified in the inner loop. But it's unaliased over that scope. So presumably could be enregistered and spilled back to its stack home on loop exit. We don't have the technology to handle cases like this (scalar promotion or similar).
CSE does not consider these trees despite knowing that the two loads (at least) will return the same value.
N001 ( 3, 4) [000110] ----G------- \--* LCL_VAR double(AX) V03 loc2 $201
N001 ( 3, 4) [000122] ----G------- \--* LCL_VAR double(AX) V03 loc2 $201
presumably this runs into:
which seems a bit short sighted? We know (as in this case) we have some invariant/expensive local var loads, why not try CSEing them if we have the room.
Will experiment.
we have some invariant/expensive local var loads, why not try CSEing them if we have the room.
Seems viable, though simply turning it on (even requiring cost >= IND_COST_EX
) leads to a lot of code churn and a few asserts about incompatible types during CSE.
EG for benchmarks (via SPMI)
[14:25:42]
[14:25:42] Total bytes of base: 10418151 (overridden on cmd)
[14:25:42] Total bytes of diff: 10445323 (overridden on cmd)
[14:25:42] Total bytes of delta: 27172 (0.26 % of base)
[14:25:42] Top method regressions (bytes):
[14:25:42] 666 ( 6.63% of base) : 1247.dasm - DynamicClass:Regex2_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[Char]):bool
[14:25:42] 546 (19.10% of base) : 9755.dasm - System.Xml.Serialization.XmlSerializationReaderILGen:GetArraySource(System.Xml.Serialization.TypeDesc,System.String,bool):System.String:this
[14:25:42] 486 (10.86% of base) : 9477.dasm - DynamicClass:Regex1_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[Char]):bool
[14:25:42] 486 (10.86% of base) : 9480.dasm - DynamicClass:Regex1_TryMatchAtCurrentPosition(System.Text.RegularExpressions.RegexRunner,System.ReadOnlySpan`1[Char]):bool
[14:25:42] Top method improvements (bytes):
[14:25:42] -601 (-13.04% of base) : 8376.dasm - Newtonsoft.Json.JsonWriter:WriteValue(Newtonsoft.Json.JsonWriter,int,System.Object)
[14:25:42] -391 (-3.30% of base) : 2438.dasm - Utf8Json.Formatters.MicroBenchmarks_Serializers_LocationFormatter1:Serialize(byref,MicroBenchmarks.Serializers.Location,Utf8Json.IJsonFormatterResolver):this
[14:25:42] -351 (-10.04% of base) : 26136.dasm - Microsoft.CodeAnalysis.CSharp.Syntax.InternalSyntax.LanguageParser:ParseNamespaceBody(byref,byref,byref,ushort):this
[14:25:42] -306 (-21.32% of base) : 5590.dasm - System.Guid:TryParseExactN(System.ReadOnlySpan`1[Char],byref):bool
[14:25:42] -302 (-19.48% of base) : 637.dasm - System.Guid:TryParseExactD(System.ReadOnlySpan`1[Char],byref):bool
[14:25:42] 3888 total methods with Code Size differences (1029 improved, 2859 regressed), 1627 unchanged.
I recall a while back I was touting something like this as an alternative to EH Write Through but never pushed on it.
Anyways, probably something too radical to land in a timely manner.
Latest update to BDN has made its way to perf repo and OSR runs. This should fix the ssl: true
"regressions" where OSR consistently fared worse than main.
Seeing a new issue now in general OSR testing, presumably from the interaction of OSR and #66257.
Assert failure(PID 12384 [0x00003060], Thread: 10696 [0x29c8]): Assertion failed 'h2 != nullptr' in 'LUDecomp:DoLUIteration(System.Double[][],System.Double[],System.Double[][][],System.Double[][],int):long' during 'Clone loops' (IL size 144; hash 0x891a5971; Tier1-OSR)
Possible next steps now that #32969 is merged, in rough order of priority.
Assert failure(PID 7028 [0x00001b74], Thread: 7084 [0x1bac]): ppInfo->m_osrMethodCode == NULL
-- likely the logic guarding against threads racing to build the patchpoint method needs adjusting (likely fixed by #38165)look at how debuggers handle OSR frames; if the double-RBP restore is too confusing, think about relying on the original method's RBP (will still need split save areas). On further thought, it seems like (for x64) we can pass the tier0 method caller's RBP to the osr method and just have one unwind restore. This is what I'm doing for arm64 and it seems to be working out ok.(new plan is to revise arm64 to conform with how x64 will work, see below)#61934#63642https://github.com/dotnet/runtime/pull/65675Issues and fixes after OSR was enabled
Performance Regressions
Other ideas: enhancements or optimizations
cc @dotnet/jit-contrib
category:cq theme:osr skill-level:expert cost:extra-large