Open AndyAyersMS opened 4 years ago
The jit is trying to clone the outer loop here:
Cloning loop L00: [head: BB01, top: BB02, entry: BB04, bottom: BB09, child: L02].
Suspect the fix is simple? Here we have BB01->bbNext == BB04. So we decide not to create a new block for h2
here:
but BB01 does not transfer control to BB04.
Here we have BB01->bbNext == BB04
Doesn't BB01->bbNext == BB02? What is the actual lexical block order?
If you set COMPlus_JitDumpFgConstrained=1
what does it look like?
Here we have BB01->bbNext == BB04
Doesn't BB01->bbNext == BB02? What is the actual lexical block order?
Yeah, sorry, I was off base. As you noticed in #67067 we get into this code but weren't setting up the right h2
.
Constrained view of the flowgraph:
There's one more libraries issue I want to investigate... on ubuntu x64
export COMPlus_TieredCompilation=1
export COMPlus_TC_OnStackReplacement=1
export COMPlus_TC_QuickJitForLoops=1
Starting: System.Text.Json.Tests (parallel test collections = on, max threads = 2)
System.Text.Json.Tests.Utf8JsonReaderTests.ReadJsonStringsWithCommentsAndTrailingCommas(jsonString: "{\"Property1\": {\"Property1.1\": 42} // comment\n"...) [FAIL]
System.Text.Json.JsonReaderException : The JSON object contains a trailing comma at the end which is not supported in this mode. Change the reader options. LineNumber: 1 | BytePositionInLine: 1
...
Issue for above #67152 -- fix in PR at #67274.
and might also fix some of the reported regressions -- taking a look.
Seeing some benchmark cases where there are methods with stackalloc + loop that bypass tiering: https://github.com/dotnet/runtime/issues/84264#issuecomment-1520715409 and hence also bypass PGO.
In particular
Not sure how common this is but something to keep an eye on. Supporting stackalloc in its full generality with OSR would be hard, because we potentially would need to track 3 addressable segments of the stack, but it's not impossible.
It might be easier to revise the BCL so this doesn't happen in places where we care about perf. The proposed mitigation would be to split the method into a caller that stackallocs and a callee that loops. These parts can be reunited (if deemed profitable) via normal inlining, or the callee marked with AggressiveInlining
.
FYI @stephentoub -- possible pattern to avoid since it creates methods that can't benefit from Dynamic PGO.
Forked this off as #85548
Not sure how common this is but something to keep an eye on.
I think it is common because many developers (who cares about allocations and performance) are writing code like below nowadays.
const int StackAllocSize = 128;
Span<T> buffer = length < StackAllocSize ? stackalloc T[length] : new T[length];
Possible next steps now that #32969 is merged, in rough order of priority.
Assert failure(PID 7028 [0x00001b74], Thread: 7084 [0x1bac]): ppInfo->m_osrMethodCode == NULL
-- likely the logic guarding against threads racing to build the patchpoint method needs adjusting (likely fixed by #38165)look at how debuggers handle OSR frames; if the double-RBP restore is too confusing, think about relying on the original method's RBP (will still need split save areas). On further thought, it seems like (for x64) we can pass the tier0 method caller's RBP to the osr method and just have one unwind restore. This is what I'm doing for arm64 and it seems to be working out ok.(new plan is to revise arm64 to conform with how x64 will work, see below)#61934#63642https://github.com/dotnet/runtime/pull/65675Issues and fixes after OSR was enabled
Performance Regressions
Other ideas: enhancements or optimizations
cc @dotnet/jit-contrib
category:cq theme:osr skill-level:expert cost:extra-large