Significant lock contention with parallel JIT compilation via expression trees

PJB3005 commented 2 weeks ago

Description

When doing a lot of JIT compilation in parallel, there seems to be heavy contention surrounding CEEInfo::reportInliningDecision. I managed to trigger this through some heavy usage of expression tree compilation we have in our project.

I was trying to parallelize our integration tests, because for some reason somebody had put them to be limited to 3x parallelism. I removed that cap so the tests could go run at x24 on my machine, and found that they wouldn't progress at all. Digging into this with Rider's timeline profiler I found the following:

Almost all the CPU time in the parallel test run was being used by parallel expression tree compilation, and almost all of that was spent inside CEEInfo::reportInliningDecision blocking on a critical section. Screenshot of the profile I made down below.

I need to make it clear that we are effectively JITing some 20k+ dynamicmethods, times 24, in parallel here. This is due to how our integration tests and serialization system works (yeah, I've been meaning to fix our serialization system, I know). When just starting our game in normal conditions we still see significant amounts of CPU time in reportInliningInfo, but it's dwarfed by unwind info reporting from the profiler being attached. Screenshot below again.

Repro

I tried making a simple repro but couldn't.

Configuration

.NET Runtime: 8.0.8 CPU: AMD Ryzen 9 5900X 12-Core Processor 3.70 GHz RAM: 32 GB Windows info:

Edition Windows 11 Pro
Version 23H2
Installed on    ‎17/‎12/‎2023
OS build    22631.4112
Experience  Windows Feature Experience Pack 1000.22700.1034.0

Regression?

Not that I am aware of.

Data

Integration test backtrace view of the critical section contentions in the JIT, almost all of it is the reportInliningInfo:

"Normal" program run backtrace:

EgorBo commented 2 weeks ago

A minimal repro would be indeed helpful

EgorBo commented 2 weeks ago

It seems that reportInlineDecision takes a lock if profiler is presented (I assume dotTrace uses profiler api too) https://github.com/dotnet/runtime/blob/ac663e44b907618c631edd614231b73677843952/src/coreclr/vm/jitinterface.cpp#L8159-L8171

cc @dotnet/area-system-diagnostics if this can be avoided?

EgorBo commented 2 weeks ago

Ah, sorry, it turns out to be a different lock - inside InlineTrackingMap::AddInlining so is not profiler related.

EgorBo commented 2 weeks ago

From a quick look it seems that we do slow lookup in InliningExistsDontTakeLock (O(N)) under a lock.

PJB3005 commented 2 weeks ago

Got a repro down by copying half the surrounding code till it happens https://github.com/PJB3005/ExpressionTreeJitContentionTest

dotnet-policy-service[bot] commented 1 week ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

EgorBo commented 1 week ago

Ok, so I spent a bit of time looking at your repro: reportInliningDecision does two things: 1) Add some tracking for ReJIT profiler (which is not used when you just run the repro in console) 2) A small overhead for ETW (JittingStarted event)

if I leave reportInliningDecision to do nothing, your repro is still slow for me (in fact, I don't see any improvement at all). It looks like JIT is indeed the bottleneck here since it has to compile a lot, here is the flamegraph:

so I don't see what we can do (especially, for 9.0) here besides general work to improve JIT's throughput. cc @dotnet/jit-contrib

PJB3005 commented 1 week ago

if I leave reportInliningDecision to do nothing, your repro is still slow for me (in fact, I don't see any improvement at all). It looks like JIT is indeed the bottleneck here since it has to compile a lot, here is the flamegraph:

Is it possible I could try this branch myself? Then I could see how it stacks up in the real scenario.

EgorBo commented 1 week ago

if I leave reportInliningDecision to do nothing, your repro is still slow for me (in fact, I don't see any improvement at all). It looks like JIT is indeed the bottleneck here since it has to compile a lot, here is the flamegraph:

Is it possible I could try this branch myself? Then I could see how it stacks up in the real scenario.

Sure, you can just delete this part https://github.com/dotnet/runtime/blob/959c10e41f1180482a7bc25e2bba16d27aca343a/src/coreclr/vm/jitinterface.cpp#L8040-L8191

and build runtime via build.cmd Clr -c Release

dotnet / runtime