AndyAyersMS commented 4 years ago

Possible next steps now that #32969 is merged, in rough order of priority.

[x] implement a regularly scheduled test run that enables OSR for pri1 tests on x64 windows and x64 Linux (both default OSR and with "OSR stress") (done, #33709)
[x] fix issues with OSR creating side entries into try regions (#35687, also seen during #34522). PR: #59784
[x] fix failures in jit-experimental runs with OSR (#43534, #47532, #51057)
[x] fix interaction of OSR and PGO (#47942 fixed via PR #61453; PR #62263)
[x] run tests with OSR and some GC and JIT stress modes (PR #61934)
[x] investigate assert seen in the weekly OSR tests Assert failure(PID 7028 [0x00001b74], Thread: 7084 [0x1bac]): ppInfo->m_osrMethodCode == NULL -- likely the logic guarding against threads racing to build the patchpoint method needs adjusting (likely fixed by #38165)
[x] partial method compilation (eg don't initially jit exceptional paths, prototyped in #34522; PR #60791)
[x] support OSR from synchronous methods (#61712)
[x] don't do OSR in methods that are asking for debug codegen (fine as is, since such methods aren't eligible for tiering)
[x] look at how OSR performs for powershell startup ([which uses QJFL=1]) (see note 4 below)
[x] implement variant of QJFL that bails out of methods that can't be OSR'd (see note 1 below) (https://github.com/dotnet/runtime/pull/61851)
[x] run ASP.NET perf tests and verify startup improvements are as expected / no steady state losses
[x] look at how debuggers handle OSR frames; if the double-RBP restore is too confusing, think about relying on the original method's RBP (will still need split save areas). On further thought, it seems like (for x64) we can pass the tier0 method caller's RBP to the osr method and just have one unwind restore. This is what I'm doing for arm64 and it seems to be working out ok. (new plan is to revise arm64 to conform with how x64 will work, see below)
[x] arm64 platform support (see note 7 below). (PR: https://github.com/dotnet/runtime/pull/62831)
[x] sort out complications with OSR and ALTJIT (see note 6 below). Addressed Arm64 PR.
[x] Exclude CI tests of crossgen2 from using OSR as it is run via a 6.0 runtime and so hits some old OSR bugs (https://github.com/dotnet/runtime/pull/62968).
[x] Fix issues uncovered by OSR stress (https://github.com/dotnet/runtime/pull/62980, https://github.com/dotnet/runtime/pull/64116)
[x] fix problem with broken epilog unwind on x64 (see note) https://github.com/dotnet/runtime/pull/65609
[x] re-enable struct promotion https://github.com/dotnet/runtime/pull/65903
[x] Track OSR impact on Techempower (data now available on the PGO tab) https://github.com/aspnet/Benchmarks/pull/1726
[x] Run debugger tests
[x] Ensure BenchmarkDotNet optimizes key auto-generated methods to avoid holding on to GC references (see notes), also https://github.com/dotnet/BenchmarkDotNet/issues/1934 and https://github.com/dotnet/BenchmarkDotNet/pull/1935
[x] Update dotnet/performance with new BDN version and verify everything's ok. Some of the tests I expected to improve did, others did not -- see note below.
[x] Look more closely at interaction of OSR and loop optimizer (see notes below) https://github.com/dotnet/runtime/pull/66208
[x] Update perfview / traceevent to properly parse new jit type (data there already) https://github.com/microsoft/perfview/pull/1584
[x] Update DAC/SOS to properly understand new native code versions for methods (see notes) https://github.com/dotnet/diagnostics/pull/2928 and https://github.com/dotnet/runtime/pull/66507
[x] Fix ARM64 issue with large OSR funclet frames on arm64 https://github.com/dotnet/runtime/issues/65996 (via https://github.com/dotnet/runtime/pull/66124)
[x] Update BDN iteration strategy for long running benchmarks. https://github.com/dotnet/BenchmarkDotNet/pull/1949 and https://github.com/dotnet/performance/pull/2323
[x] Fix bad interaction of OSR and the more general loop cloning introduced in #66257 (see note) https://github.com/dotnet/runtime/pull/67067
[x] Fix stress failure https://github.com/dotnet/runtime/issues/67078 (via https://github.com/dotnet/runtime/pull/67131)
[x] run perf test suite with OSR and investigate any regressions versus current default (see notes, more notes). We have temporarily co-opted the regularly scheduled "no pgo" perf lab runs for windows x64 to actually run OSR. Results here. And also enabled autofiling so OSR perf results that differ from the old no pgo perf are reported as regressions/improvements. Example of regressions.
[x] Investigate test failure https://github.com/dotnet/runtime/issues/67215 (likely unrelated, see #66924)
[x] Fix stress failure https://github.com/dotnet/runtime/issues/67152 (https://github.com/dotnet/runtime/pull/67274)
[x] Enable QJFL and OSR by default for x64/arm64 ~~#61934~~ ~~#63642~~ https://github.com/dotnet/runtime/pull/65675
[x] Enable use of sparse edge instrumentation in OSR methods (https://github.com/dotnet/runtime/issues/47942). #80481
[x] Import entire method initially and trim unneeded parts once we are done with morph. This fixes lingering issues with computing local exposure: #83910
[x] Ensure Tier0-exposed locals are normalize on load in the OSR method: #84000
[ ] Run diagnostic tests (blocked; they're not yet updated for the .NET 7 branch)

Issues and fixes after OSR was enabled

[x] https://github.com/dotnet/runtime/issues/67488 (fixed by https://github.com/dotnet/runtime/pull/67680)
[x] https://github.com/dotnet/runtime/issues/67668 (fixed by https://github.com/dotnet/runtime/pull/67678)
[x] https://github.com/dotnet/runtime/issues/67410 (fixed by https://github.com/dotnet/runtime/pull/67884)
[x] https://github.com/dotnet/runtime/issues/68003 (fixed by https://github.com/dotnet/runtime/pull/68048)
[x] https://github.com/dotnet/runtime/issues/68170 (fixed by https://github.com/dotnet/runtime/pull/68198)
[x] https://github.com/dotnet/runtime/issues/68194 (fixed by https://github.com/dotnet/runtime/pull/68202)
[x] https://github.com/dotnet/runtime/issues/70263 (fixed by https://github.com/dotnet/runtime/pull/70916)
[x] https://github.com/dotnet/runtime/issues/71005 (fixed by https://github.com/dotnet/runtime/pull/71245)
[x] https://github.com/dotnet/runtime/issues/75828 (fixed by https://github.com/dotnet/runtime/pull/75922)
[x] https://github.com/dotnet/runtime/issues/83783 (fixed by #83910)

Performance Regressions

[x] https://github.com/dotnet/runtime/issues/67594
[x] https://github.com/dotnet/runtime/issues/78127
[x] #78110
[ ] #80210
[x] #80757

Other ideas: enhancements or optimizations

[ ] Update arm64 to use the same split callee-save technique we now use on x64, and pass Tier0 FP to the OSR method. This gives arm64 methods standard epilogs.
[ ] Revise Arm64 frame layout to put PSPSym above callee-saves, so that OSR method can share Tier0 PSP, and OSR funclets don't need to pad their frames with the Tier0 frame (see notes starting here) and (more notes). Or, revise the OSR method so it shares the PSP slot with the TIer0 frame (requires split callee-save above).
[ ] look into enabling more independent promotion in OSR methods. Right now we use the Tier0 address exposure data and this is very conservative. Also see https://github.com/dotnet/runtime/pull/67131.
- Possibly addressed by https://github.com/dotnet/runtime/pull/83910
- Possibly addressed by https://github.com/dotnet/runtime/pull/83388
[ ] support OSR in methods with stackalloc (see note 2 below) and [further notes]
[ ] support OSR for reverse pinvoke methods (see note 3 below)
[ ] support OSR from methods that make explicit tail calls (see note 5 below)
[ ] implement aggressive frame trimming (reduce original method frame to just live extent)
[ ] look into viability of backpatching the patchpoint call with a jump to the OSR method instead
[ ] look into how to support limited Tier0 opts with OSR
[ ] look into emitting more compact patchpoint code sequences
[ ] look into emitting more compact patchpoint info blobs
[ ] think about asynchronous creation of OSR methods
[ ] look into the feasibility of having one OSR method cover all the patchpoints
[ ] look into using the "mutator" tool in jitutils to inject loops into methods that don't have them, so that we can trigger OSR in more cases.
- Note that random patchpoint placement and fast OSR triggers can accomplish something similar without needing to alter tests. There's nothing saying a patchpoint has to be within a loop. https://github.com/dotnet/runtime/pull/62980
[ ] OSR + GS -- perhaps OSR method should have its own cookie (if needed) in addition to the Tier0 cookie, and check them both on exit? Currently we just check the Tier0 cookie, but if the OSR frame holds saved LR/FP we might miss an overrun. Note: not needed if we move arm64 to the new x64 plan, as there's just one save area that gets restored, and it is in the Tier0 frame.
[ ] update runtime strategy to support "slow" OSR method creation, but quick transitions when OSR methods exist
[ ] support for mid-block patchpoints (where IL stack is empty). Among other things, this would let us do "source" patchpoint targeting more often.
[ ] support for patchpoints at non-stack empty IL offsets. Would require custom per-site patchpoint descriptors and more.
[ ] defer altering control flow for OSR until much later. Currently we do it very early and need to protect the original method entry specially in case we want to branch there during morph (see https://github.com/dotnet/runtime/pull/94597#issuecomment-1807163840).

cc @dotnet/jit-contrib

category:cq theme:osr skill-level:expert cost:extra-large

AndyAyersMS commented 2 years ago

The jit is trying to clone the outer loop here:

Cloning loop L00: [head: BB01, top: BB02, entry: BB04, bottom: BB09, child: L02].

image - 2022-03-23T104843 684

AndyAyersMS commented 2 years ago

Suspect the fix is simple? Here we have BB01->bbNext == BB04. So we decide not to create a new block for h2 here:

https://github.com/dotnet/runtime/blob/ea4ebaa3c5162bcabc63284ba3b59aa683912af4/src/coreclr/jit/loopcloning.cpp#L1903-L1928

but BB01 does not transfer control to BB04.

BruceForstall commented 2 years ago

Here we have BB01->bbNext == BB04

Doesn't BB01->bbNext == BB02? What is the actual lexical block order?

If you set COMPlus_JitDumpFgConstrained=1 what does it look like?

AndyAyersMS commented 2 years ago

Here we have BB01->bbNext == BB04

Doesn't BB01->bbNext == BB02? What is the actual lexical block order?

Yeah, sorry, I was off base. As you noticed in #67067 we get into this code but weren't setting up the right h2.

Constrained view of the flowgraph:

image - 2022-03-23T164633 422

AndyAyersMS commented 2 years ago

There's one more libraries issue I want to investigate... on ubuntu x64

export COMPlus_TieredCompilation=1
export COMPlus_TC_OnStackReplacement=1
export COMPlus_TC_QuickJitForLoops=1

  Starting:    System.Text.Json.Tests (parallel test collections = on, max threads = 2)
    System.Text.Json.Tests.Utf8JsonReaderTests.ReadJsonStringsWithCommentsAndTrailingCommas(jsonString: "{\"Property1\": {\"Property1.1\": 42} // comment\n"...) [FAIL]
      System.Text.Json.JsonReaderException : The JSON object contains a trailing comma at the end which is not supported in this mode. Change the reader options. LineNumber: 1 | BytePositionInLine: 1
...

AndyAyersMS commented 2 years ago

Issue for above #67152 -- fix in PR at #67274.

AndyAyersMS commented 1 year ago

83910 improved a couple of the microbenchmarks, notably

newplot - 2023-03-30T123918 036

and might also fix some of the reported regressions -- taking a look.

AndyAyersMS commented 1 year ago

Seeing some benchmark cases where there are methods with stackalloc + loop that bypass tiering: https://github.com/dotnet/runtime/issues/84264#issuecomment-1520715409 and hence also bypass PGO.

In particular

https://github.com/dotnet/runtime/blob/f2a55e228b83df6aa6dc215e295bf3da5ab6fc17/src/libraries/System.Text.Json/src/System/Text/Json/Document/JsonDocument.TryGetProperty.cs#L135-L150

Not sure how common this is but something to keep an eye on. Supporting stackalloc in its full generality with OSR would be hard, because we potentially would need to track 3 addressable segments of the stack, but it's not impossible.

It might be easier to revise the BCL so this doesn't happen in places where we care about perf. The proposed mitigation would be to split the method into a caller that stackallocs and a callee that loops. These parts can be reunited (if deemed profitable) via normal inlining, or the callee marked with AggressiveInlining.

FYI @stephentoub -- possible pattern to avoid since it creates methods that can't benefit from Dynamic PGO.

Forked this off as #85548

hez2010 commented 1 year ago

Not sure how common this is but something to keep an eye on.

I think it is common because many developers (who cares about allocations and performance) are writing code like below nowadays.

const int StackAllocSize = 128;
Span<T> buffer = length < StackAllocSize ? stackalloc T[length] : new T[length];

dotnet / runtime

On Stack Replacement Next Steps #33658

Possible next steps now that #32969 is merged, in rough order of priority.

Issues and fixes after OSR was enabled

Performance Regressions

Other ideas: enhancements or optimizations

83910 improved a couple of the microbenchmarks, notably