dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

On Stack Replacement Next Steps #33658

Open AndyAyersMS opened 4 years ago

AndyAyersMS commented 4 years ago

Possible next steps now that #32969 is merged, in rough order of priority.

Issues and fixes after OSR was enabled

Performance Regressions

Other ideas: enhancements or optimizations

cc @dotnet/jit-contrib

category:cq theme:osr skill-level:expert cost:extra-large

AndyAyersMS commented 2 years ago

The jit is trying to clone the outer loop here:

Cloning loop L00: [head: BB01, top: BB02, entry: BB04, bottom: BB09, child: L02].

image - 2022-03-23T104843 684

AndyAyersMS commented 2 years ago

Suspect the fix is simple? Here we have BB01->bbNext == BB04. So we decide not to create a new block for h2 here:

https://github.com/dotnet/runtime/blob/ea4ebaa3c5162bcabc63284ba3b59aa683912af4/src/coreclr/jit/loopcloning.cpp#L1903-L1928

but BB01 does not transfer control to BB04.

BruceForstall commented 2 years ago

Here we have BB01->bbNext == BB04

Doesn't BB01->bbNext == BB02? What is the actual lexical block order?

If you set COMPlus_JitDumpFgConstrained=1 what does it look like?

AndyAyersMS commented 2 years ago

Here we have BB01->bbNext == BB04

Doesn't BB01->bbNext == BB02? What is the actual lexical block order?

Yeah, sorry, I was off base. As you noticed in #67067 we get into this code but weren't setting up the right h2.

Constrained view of the flowgraph:

image - 2022-03-23T164633 422

AndyAyersMS commented 2 years ago

There's one more libraries issue I want to investigate... on ubuntu x64

export COMPlus_TieredCompilation=1
export COMPlus_TC_OnStackReplacement=1
export COMPlus_TC_QuickJitForLoops=1

  Starting:    System.Text.Json.Tests (parallel test collections = on, max threads = 2)
    System.Text.Json.Tests.Utf8JsonReaderTests.ReadJsonStringsWithCommentsAndTrailingCommas(jsonString: "{\"Property1\": {\"Property1.1\": 42} // comment\n"...) [FAIL]
      System.Text.Json.JsonReaderException : The JSON object contains a trailing comma at the end which is not supported in this mode. Change the reader options. LineNumber: 1 | BytePositionInLine: 1
...
AndyAyersMS commented 2 years ago

Issue for above #67152 -- fix in PR at #67274.

AndyAyersMS commented 1 year ago

83910 improved a couple of the microbenchmarks, notably

newplot - 2023-03-30T123918 036

and might also fix some of the reported regressions -- taking a look.

AndyAyersMS commented 1 year ago

Seeing some benchmark cases where there are methods with stackalloc + loop that bypass tiering: https://github.com/dotnet/runtime/issues/84264#issuecomment-1520715409 and hence also bypass PGO.

In particular

https://github.com/dotnet/runtime/blob/f2a55e228b83df6aa6dc215e295bf3da5ab6fc17/src/libraries/System.Text.Json/src/System/Text/Json/Document/JsonDocument.TryGetProperty.cs#L135-L150

Not sure how common this is but something to keep an eye on. Supporting stackalloc in its full generality with OSR would be hard, because we potentially would need to track 3 addressable segments of the stack, but it's not impossible.

It might be easier to revise the BCL so this doesn't happen in places where we care about perf. The proposed mitigation would be to split the method into a caller that stackallocs and a callee that loops. These parts can be reunited (if deemed profitable) via normal inlining, or the callee marked with AggressiveInlining.

FYI @stephentoub -- possible pattern to avoid since it creates methods that can't benefit from Dynamic PGO.

Forked this off as #85548

hez2010 commented 1 year ago

Not sure how common this is but something to keep an eye on.

I think it is common because many developers (who cares about allocations and performance) are writing code like below nowadays.

const int StackAllocSize = 128;
Span<T> buffer = length < StackAllocSize ? stackalloc T[length] : new T[length];