dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.16k stars 4.72k forks source link

Investigate why crossgen works slower with TieredCompilation/PGO #83112

Open EgorBo opened 1 year ago

EgorBo commented 1 year ago

I was measuring crossgen2.exe -O SPC.dll (actually, the exact command we use for build Clr.NativeCoreLib -c Release) and noticed a few problems:

Mode Time to prejit SPC.dll, seconds
TC=1 (Default) 4.81
TC=0 4.25
TC=1, CCDelayMS=0 3.78
TC=1, PGO=1 5.29
TC=1, PGO=1, CCDelayMS=0 3.93

Legend:

The difference is quite noticeable so worth investigating - numbers are quite stable across multiple runs. Judging by the effect from DOTNET_TC_CallCountThreshold we're having some contention for call counting stub installation/promotion to tier1.

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak See info in area-owners.md if you want to be subscribed.

Issue Details
I was measuring `crossgen2.exe -O SPC.dll` (actually, the exact command we use for `build Clr.NativeCoreLib -c Release`) and noticed a few problems: | Mode | Time to prejit SPC.dll, seconds | |-------------|---------------------------------| | TC=0 | 4.23 | | TC=1 | 4.71 | | TC=1, PGO=1 | 5.29 | The difference is quite noticeable so worth investigating. Few observations so far - it seems there is a huge benefit from increasing call counting threshold for R2R'd code to e.g `1000` - will file a PR with that since @davidwrighton made `IsReadyToRun(PCODE)` VM API cheap now. It's needed because we don't want to re-jit R2R to InstrumentedTier too early. Investigating this in VTune now, e.g. here is a VTune comparison for `TC=1,PGO=1` vs `TC=1,PGO=0`: ![image](https://user-images.githubusercontent.com/523221/223566939-8b59f1cc-cf04-4c17-9baf-cb715f185d82.png)
Author: EgorBo
Assignees: EgorBo
Labels: `area-CodeGen-coreclr`, `untriaged`
Milestone: -
EgorBo commented 1 year ago

Fun-fact: DOTNET_TC_CallCountingDelayMs=1 makes TC=1 (default) faster than TC=0. So apparently there is a huge contention to install call counting stubs

EgorBo commented 1 year ago

cc @noahfalk @kouvel

kunalspathak commented 1 year ago

From https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362_RunKind%3dcrossgen_scenarios%2fCrossgen2%20Throughput%20-%20Single%20-%20System.Private.CoreLib.html

image

The regression is seen after enabling dynamic PGO in https://github.com/dotnet/runtime/pull/86225

@davidwrighton @mangod9

AndyAyersMS commented 1 year ago

As mentioned offline we also ought to start measuring with the NAOT'd crossgen2.

MichalStrehovsky commented 1 year ago

89489 disabled tiering to work around (matching the workaround used in ILC already) so if we still do TP measurements in the non-shipping configuration of crossgen2, there's going to be an improvement.