Open PJB3005 opened 2 weeks ago
A minimal repro would be indeed helpful
It seems that reportInlineDecision
takes a lock if profiler is presented (I assume dotTrace uses profiler api too)
https://github.com/dotnet/runtime/blob/ac663e44b907618c631edd614231b73677843952/src/coreclr/vm/jitinterface.cpp#L8159-L8171
cc @dotnet/area-system-diagnostics if this can be avoided?
Ah, sorry, it turns out to be a different lock - inside InlineTrackingMap::AddInlining
so is not profiler related.
From a quick look it seems that we do slow lookup in InliningExistsDontTakeLock
(O(N)) under a lock.
Got a repro down by copying half the surrounding code till it happens https://github.com/PJB3005/ExpressionTreeJitContentionTest
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Ok, so I spent a bit of time looking at your repro:
reportInliningDecision
does two things:
1) Add some tracking for ReJIT profiler (which is not used when you just run the repro in console)
2) A small overhead for ETW (JittingStarted event)
if I leave reportInliningDecision
to do nothing, your repro is still slow for me (in fact, I don't see any improvement at all). It looks like JIT is indeed the bottleneck here since it has to compile a lot, here is the flamegraph:
so I don't see what we can do (especially, for 9.0) here besides general work to improve JIT's throughput. cc @dotnet/jit-contrib
if I leave
reportInliningDecision
to do nothing, your repro is still slow for me (in fact, I don't see any improvement at all). It looks like JIT is indeed the bottleneck here since it has to compile a lot, here is the flamegraph:
Is it possible I could try this branch myself? Then I could see how it stacks up in the real scenario.
if I leave
reportInliningDecision
to do nothing, your repro is still slow for me (in fact, I don't see any improvement at all). It looks like JIT is indeed the bottleneck here since it has to compile a lot, here is the flamegraph:Is it possible I could try this branch myself? Then I could see how it stacks up in the real scenario.
Sure, you can just delete this part https://github.com/dotnet/runtime/blob/959c10e41f1180482a7bc25e2bba16d27aca343a/src/coreclr/vm/jitinterface.cpp#L8040-L8191
and build runtime via build.cmd Clr -c Release
Description
When doing a lot of JIT compilation in parallel, there seems to be heavy contention surrounding
CEEInfo::reportInliningDecision
. I managed to trigger this through some heavy usage of expression tree compilation we have in our project.I was trying to parallelize our integration tests, because for some reason somebody had put them to be limited to 3x parallelism. I removed that cap so the tests could go run at x24 on my machine, and found that they wouldn't progress at all. Digging into this with Rider's timeline profiler I found the following:
Almost all the CPU time in the parallel test run was being used by parallel expression tree compilation, and almost all of that was spent inside
CEEInfo::reportInliningDecision
blocking on a critical section. Screenshot of the profile I made down below.I need to make it clear that we are effectively JITing some 20k+ dynamicmethods, times 24, in parallel here. This is due to how our integration tests and serialization system works (yeah, I've been meaning to fix our serialization system, I know). When just starting our game in normal conditions we still see significant amounts of CPU time in
reportInliningInfo
, but it's dwarfed by unwind info reporting from the profiler being attached. Screenshot below again.Repro
I tried making a simple repro but couldn't.
Configuration
.NET Runtime: 8.0.8 CPU: AMD Ryzen 9 5900X 12-Core Processor 3.70 GHz RAM: 32 GB Windows info:
Regression?
Not that I am aware of.
Data
Integration test backtrace view of the critical section contentions in the JIT, almost all of it is the
reportInliningInfo
:"Normal" program run backtrace: