dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.75k forks source link

Native AOT - Static PGO MIBC passed to ILC appears to be ignored #95236

Open HighPerfDotNet opened 11 months ago

HighPerfDotNet commented 11 months ago

Description

Hello,

I've been trying to get Static PGO working in Native AOT builds, something that is really necessary due otherwise lower performance observed by me in such builds. For this repro I've tried to follow one rare example how it should be working, from here: https://znlive.com/how-to-use-pgo-to-improve-the-performance (Static PGO seems to be very rarely used it seems, which made me sad).

I've followed all instructions and finally collected profile stats into MIBC file. However after that all my attempts to get this data to be actually used failed - resulting binaries were as slow as without profile data. Here are outputs from builds:

  1. Dynamic PGO - as expected first iteration is slow, but then things speed up:

Iteration 0: 1212 ms. Iteration 1: 998 ms. Iteration 2: 1012 ms. Iteration 3: 1043 ms. Iteration 4: 1025 ms. Iteration 5: 1020 ms. Iteration 6: 1033 ms. Iteration 7: 1028 ms. Iteration 8: 1031 ms. Iteration 9: 1011 ms.

  1. Native AOT version as is, without Static PGO applied -

    Iteration 0: 1542 ms. Iteration 1: 1533 ms. Iteration 2: 1561 ms. Iteration 3: 1564 ms. Iteration 4: 1556 ms. Iteration 5: 1537 ms. Iteration 6: 1554 ms. Iteration 7: 1567 ms. Iteration 8: 1563 ms. Iteration 9: 1643 ms.

  2. Native AOT version supposedly WITH Static PGO:

    Iteration 0: 1546 ms. Iteration 1: 1529 ms. Iteration 2: 1527 ms. Iteration 3: 1533 ms. Iteration 4: 1522 ms. Iteration 5: 1533 ms. Iteration 6: 1541 ms. Iteration 7: 1535 ms. Iteration 8: 1534 ms. Iteration 9: 1524 ms.

Binaries are nearly identical also, to it seems that PGO data is just being ignored.

I've read practically everything I cound find on this subject on the web and also here in old issues, typical error appears to have been passing multiple parameters to crossgen2, I've tried those that are supposed to work and all sort of combinations all to no avail. My current params are:

/p:PublishReadyToRunCrossgen2ExtraArgs=--embed-pgo-data%3b--mibc%3apgo.mibc%3b--compilebubblegenerics

Note that --compilebubblegenerics was added based on reading this old issue https://github.com/dotnet/runtime/issues/55695

I hope that I've made some silly mistake in parameters, however at this point I've exhausted all possibilities and would love to hear from people who know this stuff better than me, attaching small repro with MIBC file.

I had it run with logging ( -v:n passed to dotnet) and in log noticed the following:

Skipping target "_CreateR2RImages" because it has no outputs.

I don't know if this is significant, binaries are being created, and no play with options made this change - my general feeling is that crossgen2 isn't used at all, there is no mention of it in the generated log. I can't see what I am doing wrong though as this is a very basic default project that in theory should work just fine. HELP!

Reproduction Steps

StaticPGO.zip

File includes the following:

build_all.bat - batch file to build executables with and without PGO being applied build_no_pgo.bat - build batch file without PGO file provided build_with_pgo.bat - build batch file with PGO mibc file provided pgo.mibc - profile data captured for this application (removed scripts to do it for simplicity but can provide) PgoExperiment.csproj - project file Program.cs - simple test source code

Expected behavior

MIBC file with application profile data should have been taken into account to improve performance of Native AOT build, which is otherwise much slower than non-native AOT build.

Actual behavior

Based on performance PGO file appears to be ignored completely and binaries are almost idential also, apart from a few bytes (time/version stamps most likely).

Regression?

Yes - based on that it appears to have worked in the past with .NET 6

Known Workarounds

None since Native AOT is only kind-of-ready now with .NET 8 release.

Configuration

.NET 8.0.0 with SDK Current Visual Studio 2022 17.8.1 Windows 10 Intel 9900K CPU, 32 GB RAM

Other information

As a side note - Native AOT is great, but it does require Static PGO to get closer (or maybe even ahead) or normal builds, as right now it is noticeably slower, say in one of our big projects I've observed 20-25% perf regression compared with normal builds, BUT memory usage was a LOT better - so using Static PGO is totally essential in my view, which in theory should be simple but in practice I've run into difficulty getting it off the ground, not least due to this reported issue.

Also dotnet-pgo.exe should be made available as part of SDK, old executable that can be obtained for .NET 6 does not work on .NET 8 (which is understandable), and it was rather challenging to get it build from sources (run into issues there that don't want to bring up here), even though I was successful in the end in getting it I think this tool is necessary to be provided in binary form in SDK.

ghost commented 11 months ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Hello, I've been trying to get Static PGO working in Native AOT builds, something that is really necessary due otherwise lower performance observed by me in such builds. For this repro I've tried to follow one rare example how it should be working, from here: https://znlive.com/how-to-use-pgo-to-improve-the-performance (Static PGO seems to be very rarely used it seems, which made me sad). I've followed all instructions and finally collected profile stats into MIBC file. However after that all my attempts to get this data to be actually used failed - resulting binaries were as slow as without profile data. Here are outputs from builds: 1. Dynamic PGO - as expected first iteration is slow, but then things speed up: > Iteration 0: 1212 ms. > Iteration 1: 998 ms. > Iteration 2: 1012 ms. > Iteration 3: 1043 ms. > Iteration 4: 1025 ms. > Iteration 5: 1020 ms. > Iteration 6: 1033 ms. > Iteration 7: 1028 ms. > Iteration 8: 1031 ms. > Iteration 9: 1011 ms. 2. Native AOT version as is, without Static PGO applied - > Iteration 0: 1542 ms. > Iteration 1: 1533 ms. > Iteration 2: 1561 ms. > Iteration 3: 1564 ms. > Iteration 4: 1556 ms. > Iteration 5: 1537 ms. > Iteration 6: 1554 ms. > Iteration 7: 1567 ms. > Iteration 8: 1563 ms. > Iteration 9: 1643 ms. 3. Native AOT version supposedly WITH Static PGO: > Iteration 0: 1546 ms. > Iteration 1: 1529 ms. > Iteration 2: 1527 ms. > Iteration 3: 1533 ms. > Iteration 4: 1522 ms. > Iteration 5: 1533 ms. > Iteration 6: 1541 ms. > Iteration 7: 1535 ms. > Iteration 8: 1534 ms. > Iteration 9: 1524 ms. Binaries are nearly identical also, to it seems that PGO data is just being ignored. I've read practically everything I cound find on this subject on the web and also here in old issues, typical error appears to have been passing multiple parameters to crossgen2, I've tried those that are supposed to work and all sort of combinations all to no avail. My current params are: /p:PublishReadyToRunCrossgen2ExtraArgs=--embed-pgo-data%3b--mibc%3apgo.mibc%3b--compilebubblegenerics Note that --compilebubblegenerics was added based on reading this old issue https://github.com/dotnet/runtime/issues/55695 I hope that I've made some silly mistake in parameters, however at this point I've exhausted all possibilities and would love to hear from people who know this stuff better than me, attaching small repro with MIBC file. I had it run with logging ( -v:n passed to dotnet) and in log noticed the following: > Skipping target "_CreateR2RImages" because it has no outputs. I don't know if this is significant, binaries are being created, and no play with options made this change - my general feeling is that crossgen2 isn't used at all, there is no mention of it in the generated log. I can't see what I am doing wrong though as this is a very basic default project that in theory should work just fine. HELP! ### Reproduction Steps [StaticPGO.zip](https://github.com/dotnet/runtime/files/13465486/StaticPGO.zip) File includes the following: build_all.bat - batch file to build executables with and without PGO being applied build_no_pgo.bat - build batch file without PGO file provided build_with_pgo.bat - build batch file with PGO mibc file provided pgo.mibc - profile data captured for this application (removed scripts to do it for simplicity but can provide) PgoExperiment.csproj - project file Program.cs - simple test source code ### Expected behavior MIBC file with application profile data should have been taken into account to improve performance of Native AOT build, which is otherwise much slower than non-native AOT build. ### Actual behavior Based on performance PGO file appears to be ignored completely and binaries are almost idential also, apart from a few bytes (time/version stamps most likely). ### Regression? Yes - based on that it appears to have worked in the past with .NET 6 ### Known Workarounds None since Native AOT is only kind-of-ready now with .NET 8 release. ### Configuration .NET 8.0.0 with SDK Current Visual Studio 2022 17.8.1 Windows 10 Intel 9900K CPU, 32 GB RAM ### Other information As a side note - Native AOT is great, but it does require Static PGO to get closer (or maybe even ahead) or normal builds, as right now it is noticeably slower, say in one of our big projects I've observed 20-25% perf regression compared with normal builds, BUT memory usage was a LOT better - so using Static PGO is totally essential in my view, which in theory should be simple but in practice I've run into difficulty getting it off the ground, not least due to this reported issue. Also dotnet-pgo.exe should be made available as part of SDK, old executable that can be obtained for .NET 6 does not work on .NET 8 (which is understandable), and it was rather challenging to get it build from sources (run into issues there that don't want to bring up here), even though I was successful in the end in getting it I think this tool is necessary to be provided in binary form in SDK.
Author: HighPerfDotNet
Assignees: -
Labels: `area-crossgen2-coreclr`, `untriaged`, `area-NativeAOT-coreclr`
Milestone: -
hez2010 commented 11 months ago

/p:PublishReadyToRunCrossgen2ExtraArgs=--embed-pgo-data%3b--mibc%3apgo.mibc%3b--compilebubblegenerics

You are passing the arguments to the crossgen2, which is used for R2R, not NativeAOT. To use static PGO with NativeAOT, you need to pass the pgo data to ilc:

<ItemGroup>
    <IlcArg Include="--mibc%3apgo.mibc" />
</ItemGroup>
HighPerfDotNet commented 11 months ago

Thank you for your suggestion, I have not seen anything regarding passing MIBC to Llc (there are like under 20 matches in Google for "ItemGroup IlcArg"), I assume this just needs to be added to the project file, which I did and new compile results in a slightly different file in size, so SOMETHING is happening, but, I am not seeing any performance difference compared to same Native AOT build without PGO file, so it's still does not appear to work. :-(

I've tried to compile without PublishAot set to false, and that indeed kicked in crossgen2, however this form of AOT isn't what I am looking for, I clearly wrongly thought crossgen2 is what will get to "real" Native AOT with all the expected memory savings, that as I currently understand can't be expected from R2R stuff.

hez2010 commented 11 months ago

I just checked your pgo.mibc file, apparently it's not the pgo data of your program (you can verify this by using dotnet-pgo dump -i pgo.mibc -o pgo.txt).

Some notes:

HighPerfDotNet commented 11 months ago

Ah, I see - for collecting traces I used command line from that sample blog entry - it did surprise me that the trace file was rather large, but I did not pay much attention to it since it looked like all worked. In order to collect data previously I used the following:

_

set DOTNET_EnableEventPipe=1 set DOTNET_EventPipeConfig=Microsoft-Windows-DotNETRuntime:0x1F000080018:5 rem # Trace file output path set DOTNET_EventPipeOutputPath=trace.nettrace rem # Disable AOT set DOTNET_ReadyToRun=0 rem # Enable Tiered PGO set DOTNET_TieredPGO=1 rem # Never generate tier 1 code set DOTNET_TC_CallCounting=0 set DOTNET_TC_QuickJitForLoops=1 set DOTNET_JitCollect64BitCounts=1

dotnet run -c Release

Looks like THAT collected far more data, or something else happened. Based on your advice I now used -

dotnet bin\Release\net8.0\win-x64\PgoExperiment.dll

This collected a LOT LESS data, so it does look legit! So, all that good so far, BUT here is the same problem still - profile stats did not improve performance one bit, getting slightly different binaries however with and without MIBC file.

Here is updated ZIP with now two projects - one that passes MIBC file to ILC and the other does not, binaries are slightly different now, sufficiently to feel that PGO was fed, but there is still problem - zero effect on perf.

build_all.bat builds both versions, and I've included batch files I used to collect and convert traces. MIBC dump file appears to be correct to me

PgoExperiment.zip

PgoExperiment_NoPGO.exe

Iteration 0: 1520 ms. Iteration 1: 1507 ms. Iteration 2: 1516 ms. Iteration 3: 1517 ms. Iteration 4: 1523 ms. Iteration 5: 1515 ms. Iteration 6: 1517 ms. Iteration 7: 1511 ms. Iteration 8: 1512 ms. Iteration 9: 1515 ms.

PgoExperiment_WithPGO.exe

Iteration 0: 1534 ms. Iteration 1: 1519 ms. Iteration 2: 1516 ms. Iteration 3: 1524 ms. Iteration 4: 1510 ms. Iteration 5: 1511 ms. Iteration 6: 1521 ms. Iteration 7: 1526 ms. Iteration 8: 1520 ms. Iteration 9: 1508 ms.

Where as JITed version gives consistently expected much better perf -

bin\Release\net8.0\win-x64\PgoExperiment_WithPGO.exe

Iteration 0: 1209 ms. Iteration 1: 991 ms. Iteration 2: 987 ms. Iteration 3: 1011 ms. Iteration 4: 994 ms. Iteration 5: 990 ms. Iteration 6: 990 ms. Iteration 7: 992 ms. Iteration 8: 994 ms. Iteration 9: 994 ms.

I am going to log off to catch some sleep now, and plan to test much bigger real app that we have in hope that MAYBE for this small test app collected PGO was not enough, even though in theory it should have been sufficient?

HighPerfDotNet commented 11 months ago

Ok, I've tested actual big app (can't give source unfortunately) that we have on real data, I've double checked that: 1) PGO tracing stats are collected correctly (looked at text dump) 2) ILC was provided MIBC file in project, when wrong file was used it failed saying it can't get it

Here is what I've got (time in seconds)

.NET 8 (JIT): 257.987 259.523 252.885

.NET 8 Native AOT: 407.059 405.663 407.269

.NET 8 Native AOT + PGO 405.984 409.598 403.084

No effect basically and as you can see Native AOT builds are significantly slower (but with a LOT better memory usage!), I've passed --verbose to ILC and could not see anything about PGO or MIBC. I've tried looking at the runtime source code and found in runtime\src\coreclr\nativeaot\BuildIntegration\Microsoft.NETCore.Native.targets interesting stuff -

<IlcPgoOptimize Condition="'$(IlcPgoOptimize)' == '' and '$(OptimizationPreference)' == 'Speed'">true</IlcPgoOptimize>

and

<ItemGroup>                                                                       
<MibcFile Include="$(IlcMibcPath)*.mibc" Condition="'$(IlcPgoOptimize)' == 'true'" />
</ItemGroup>                                                                

This all seems very relevant, but I am not expert on MSBuild and not sure how to interprete those - would that condition turn on if MIBC files were given? Providing just path via --mibc command line fails, so specific file needs to be given there.

There ought to exist a working short example that demonstrates that PGO actually works with Native AOT - so far my feeling is that it has no effect, perhaps due to some other switch that needs to be provided?

hez2010 commented 11 months ago

Seems that you need to pass /p:OptimizationPreference=Speed as well.

HighPerfDotNet commented 11 months ago

I believe I've got that in effect from the project file via -

<IlcArg Include="--Ot" />

I've verified that changing that value to --Os optimises for size.

I also have this in the project file -

<OptimizationPreference>Speed</OptimizationPreference>

I've added /p:OptimizationPreference=Speed to that ZIPped project builds and it has zero effect - binaries almost identical, just few bytes different.

One VERY strange thing I noticed is that sometimes I get execution of around 1500 ms on my system, but in other times 1700 ms. When optimising for Size I get speed of around 1950-2000 ms. I get this VERY weird feeling that middle range perf is default optimisation value, but quiet why would that be different from requested Speed I don't know.

I am going to try to see is actual non-Native AOT stuff that was used in blog post on top (ie R2R from crossgen2) will actually take MIBC file into account and improve perf, my guess is that R2R won't be substantially different memory-wise from JIT, and the main attraction of Native AOT for me is to benefit from clear substantial memory reductions, but I see no valid reason why performance should suffer so much, if anything I would have expected a bit faster execution from stripped down model.

HighPerfDotNet commented 11 months ago

Apologies for posting too much, but something interesting is really going on here: I've moved testing to another box: AMD 7800X3D, and was able to finally SEE performance difference between two binaries:

JIT (much much quicker than 9900K):

Iteration 0: 663 ms.
Iteration 1: 435 ms.
Iteration 2: 431 ms.
Iteration 3: 431 ms.
Iteration 4: 432 ms.

No PGO (only slightly quicker than 9900K - which is what I expected):

Iteration 0: 1612 ms.
Iteration 1: 1619 ms.
Iteration 2: 1679 ms.
Iteration 3: 1654 ms.
Iteration 4: 1652 ms.

With PGO:

Iteration 0: 1427 ms
Iteration 1: 1447 ms
Iteration 2: 1455 ms
Iteration 3: 1436 ms
Iteration 4: 1443 ms

It's finally quicker! Not by much, but still - clearly PGO data HAD SOME EFFECT!

Now here is even more interesting part, I've changed source code to have top level loop to have more than 10 iterations, and when that number was 30 or higher, I've got consistently improved perf, but also ONLY on AMD:

With PGO (loop 30+)

Iteration 0: 1228 ms.
Iteration 1: 1218 ms.
Iteration 2: 1212 ms.
Iteration 3: 1233 ms.
Iteration 4: 1213 ms.

Again, this was ONLY on AMD hardware - I've tested same binaries on another 9900k without .NET 8, a Skylake class server, and also Zen 2 server - which also shown quicker execution for build with PGO.

Now my new AMD box is with lots of L3 cache, but I doubt very much this is so important in this case - whole program is just 1.2MB, so it fits into L3 cache on all processors that I've tested.

Sadly most of our servers are Intel, so I'd prefer speed ups to be the other way, but in any case even on AMD JIT version is crazy quick - more than double speed of 9900K and, most importatly, still 3 times quicker than Native AOT version with Static PGO.

Will continue to investigate...

hez2010 commented 11 months ago

I think the performance regression in nativeaot might be related to usage to AVX internally. By default nativeaot doesn't use AVX by default, to enable this you need to set:

<PropertyGroup>
  <IlcInstructionSet>native</IlcInstructionSet>
</PropertyGroup>
agocke commented 11 months ago

cc @MichalStrehovsky if there's anything else needed to make static pgo work

HighPerfDotNet commented 11 months ago

I've tried setting different sets, ie x86-x64-v3 or Skylake, tried the above Native also - no difference, which is expected as IMHO this small piece of code has not got much scope for AVX related stuff. It does not call internal logic either that could have benefited much.

I've been thinking and searching past issues and saw this one -

https://github.com/dotnet/runtime/issues/94052

It seems to me that this is a far better approach than current collection of Static PGO and its analysis by dotnet-pgo - collecting this data from JIT itself whilst running over sample data seems far more logical to me - that should ensure static compilation gets the same data as JIT got, so at the very least there should be no practical difference in performance between JIT and Native AOT - certainly not having JIT being MUCH quicker.

I am going to dabble now into disasm territory and hopefully get some clues from it, but my asm skills are rather rusty...

jkotas commented 11 months ago

I am going to dabble now into disasm territory

You may want to run it under sampling profiler to see the distribution of the slow down. Is the slow down distributed evenly over all methods of your app or is it concentrated in a few methods?

HighPerfDotNet commented 11 months ago

Yes, I'll do that now, got vTune here. The "app in question" isn't mine, I've provided repro above in file - PgoExperiment.zip

I've also got "my apps", but they are sadly too big to be useful for analysis here, the reason I mentioned it is so that it does not look like this small repro is isolated artificial example.

HighPerfDotNet commented 11 months ago

Ok, here is top level view from Intels vTune Profiler (this is on Intel 9900K CPU, using hardware based event sampling, 1 ms):

JIT based:

JIT JIT_Stack

Native AOT with PGO:

WithPGO WithPGO_Stack

It seems to me that MoveNext isn't getting inlined here perhaps?

PGO dump file mentions this with some stats -

  "Method": "[S.P.CoreLib]System.Runtime.CompilerServices.ConditionalWeakTable`2+Enumerator<System.__Canon,System.__Canon>.MoveNext()",

Source code (.CS file with extra txt extention): Program.cs.txt

HighPerfDotNet commented 11 months ago

Perhaps the core of the issue is bad PGO collection? That method is called a LOT of times, but it seems to me this isn't that obvious from the data made available in MIBC file's dump - I don't know enough about it to interprete though, but one would have thought high call count would be reflected in big numbers there.

"Method": "[S.P.CoreLib]System.Runtime.CompilerServices.ConditionalWeakTable`2+Enumerator<System.Canon,System.Canon>.MoveNext()", "InstrumentationData": [ { "ILOffset": 41, "InstrumentationKind": "EdgeLongCount", "Other": 93, "Data": 0 }, { "ILOffset": 93, "InstrumentationKind": "EdgeLongCount", "Other": 107, "Data": 2 }, { "ILOffset": 107, "InstrumentationKind": "EdgeLongCount", "Other": 119, "Data": 2 }, { "ILOffset": 112, "InstrumentationKind": "EdgeLongCount", "Other": 118, "Data": 4 }, { "ILOffset": 118, "InstrumentationKind": "EdgeLongCount", "Other": 109, "Data": 4 }, { "ILOffset": 119, "InstrumentationKind": "EdgeLongCount", "Other": 0, "Data": 2 }, { "ILOffset": 121, "InstrumentationKind": "EdgeLongCount", "Other": 0, "Data": 2 } ]

AndyAyersMS commented 11 months ago

If you can dump the jit disasm for the hot methods you should be able to see if the JIT is being fed profile data and roughly how much of it was available.

I think this can be done with something like

<IlcArg Include="--codegenopt JitDisasm=Program::<Main>$" />
<IlcArg Include="--codegenopt JitStdOutFile=... some file path..." />

where ...some file path... is an existing directory plus file name to use for the output.

If the profile data is thin the jit may not be willing to bet on it for key decisions like inlining. You should force your profiling run to produce instrumented code right away, and (as you were doing above) force methods to remain in Tier0 + instrumented code for longer so that the MIBC file has more counts.

So try adding

set DOTNET_TieredPGO_InstrumentOnlyHotCode=0

to the set of env vars you are using for the profiling run.

MichalStrehovsky commented 11 months ago

The sample code shouldn't even need PGO data on Native AOT. The interface calls should trivially be devirtualized because there's only one implementation in the entire program. I don't see quite that happening in .NET 8 for some reason, but it does happen on .NET 9 (main branch).

.NET 8 native AOT disassembly of Test:

  ; Assembly listing for method Program:Test(IGeneratorFactory):int (FullOpts)
  ; Emitting BLENDED_CODE for generic X64 - Windows
  ; FullOpts code
  ; NativeAOT compilation
  ; optimized code
  ; rsp based frame
  ; partially interruptible

  G_M000_IG01:                ;; offset=0x0000
         push     rsi
         push     rbx
         sub      rsp, 40

  G_M000_IG02:                ;; offset=0x0006
         mov      rax, qword ptr [rcx]
         call     [rax+0x30]IGeneratorFactory:CreateGenerator():IGenerator:this
         mov      rbx, rax
         xor      esi, esi
         mov      rcx, rbx
         cmp      dword ptr [rcx], ecx
         call     MyGenerator:MoveNext():bool:this
         test     eax, eax
         je       SHORT G_M000_IG04

  G_M000_IG03:                ;; offset=0x001F
         add      esi, dword ptr [rbx+0x0C]
         mov      rcx, rbx
         call     MyGenerator:MoveNext():bool:this
         test     eax, eax
         jne      SHORT G_M000_IG03

  G_M000_IG04:                ;; offset=0x002E
         mov      eax, esi

  G_M000_IG05:                ;; offset=0x0030
         add      rsp, 40
         pop      rbx
         pop      rsi
         ret

  ; Total bytes of code 55

.NET 8 JIT disassembly:

; Assembly listing for method Program:Test(IGeneratorFactory):int (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 18
; 1 inlinees with PGO data; 7 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000
       push     rdi
       push     rsi
       push     rbx
       sub      rsp, 32

G_M000_IG02:                ;; offset=0x0007
       mov      rax, 0x7FFF5BF8DB38
       cmp      qword ptr [rcx], rax
       jne      G_M000_IG13
       mov      rcx, 0x7FFF5BF8F9E8
       call     CORINFO_HELP_NEWSFAST
       mov      rbx, rax

G_M000_IG03:                ;; offset=0x002C
       xor      esi, esi
       test     rbx, rbx
       je       SHORT G_M000_IG07
       mov      rcx, 0x7FFF5BF8F9E8
       cmp      qword ptr [rbx], rcx
       jne      SHORT G_M000_IG07
       jmp      SHORT G_M000_IG05

G_M000_IG04:                ;; offset=0x0044
       mov      edi, dword ptr [rbx+0x0C]
       add      esi, edi

G_M000_IG05:                ;; offset=0x0049
       mov      rcx, rbx
       cmp      byte  ptr [rcx+0x10], 0
       jne      SHORT G_M000_IG11
       mov      ecx, dword ptr [rbx+0x08]
       inc      ecx
       mov      dword ptr [rbx+0x08], ecx
       cmp      ecx, 0x3E8
       jg       G_M000_IG14
       mov      dword ptr [rbx+0x0C], ecx
       jmp      SHORT G_M000_IG04

G_M000_IG06:                ;; offset=0x006B
       add      esi, edi

G_M000_IG07:                ;; offset=0x006D
       mov      rcx, 0x7FFF5BF8F9E8
       cmp      qword ptr [rbx], rcx
       jne      SHORT G_M000_IG10
       mov      rcx, rbx
       cmp      byte  ptr [rcx+0x10], 0
       jne      SHORT G_M000_IG11
       mov      ecx, dword ptr [rbx+0x08]
       inc      ecx
       mov      dword ptr [rbx+0x08], ecx
       cmp      ecx, 0x3E8
       jg       SHORT G_M000_IG09
       mov      dword ptr [rbx+0x0C], ecx
       mov      edi, dword ptr [rbx+0x0C]
       jmp      SHORT G_M000_IG06

G_M000_IG08:                ;; offset=0x009D
       mov      rcx, rbx
       mov      r11, 0x7FFF5BC70088
       call     [r11]IGenerator:get_Current():int:this
       mov      edi, eax
       jmp      SHORT G_M000_IG06

G_M000_IG09:                ;; offset=0x00B1
       mov      byte  ptr [rbx+0x10], 1
       jmp      SHORT G_M000_IG11

G_M000_IG10:                ;; offset=0x00B7
       mov      rcx, rbx
       mov      r11, 0x7FFF5BC70080
       call     [r11]IGenerator:MoveNext():bool:this
       test     eax, eax
       jne      SHORT G_M000_IG08

G_M000_IG11:                ;; offset=0x00CB
       mov      eax, esi

G_M000_IG12:                ;; offset=0x00CD
       add      rsp, 32
       pop      rbx
       pop      rsi
       pop      rdi
       ret

G_M000_IG13:                ;; offset=0x00D5
       mov      rax, qword ptr [rcx]
       mov      rax, qword ptr [rax+0x40]
       call     [rax+0x20]IGeneratorFactory:CreateGenerator():IGenerator:this
       mov      rbx, rax
       jmp      G_M000_IG03

G_M000_IG14:                ;; offset=0x00E7
       mov      byte  ptr [rbx+0x10], 1
       jmp      SHORT G_M000_IG11

; Total bytes of code 237

The JIT version needs to use guarded devirtualizations but native AOT does not. For some reason there's a virtual call left behind to CreateGenerator. This seems to be fixed in .NET 9 native AOT:

  ; Assembly listing for method Program:Test(IGeneratorFactory):int (FullOpts)
  ; Emitting BLENDED_CODE for generic X64 - Windows
  ; FullOpts code
  ; NativeAOT compilation
  ; optimized code
  ; rsp based frame
  ; partially interruptible

  G_M000_IG01:                ;; offset=0x0000
         push     rsi
         push     rbx
         sub      rsp, 40

  G_M000_IG02:                ;; offset=0x0006
         cmp      byte  ptr [rcx], cl
         lea      rcx, [(reloc 0x4227e0)]
         call     CORINFO_HELP_NEWSFAST
         mov      rbx, rax
         xor      esi, esi
         mov      rcx, rbx
         call     MyGenerator:MoveNext():ubyte:this
         test     eax, eax
         je       SHORT G_M000_IG04

  G_M000_IG03:                ;; offset=0x0025
         add      esi, dword ptr [rbx+0x0C]
         mov      rcx, rbx
         call     MyGenerator:MoveNext():ubyte:this
         test     eax, eax
         jne      SHORT G_M000_IG03

  G_M000_IG04:                ;; offset=0x0034
         mov      eax, esi

  G_M000_IG05:                ;; offset=0x0036
         add      rsp, 40
         pop      rbx
         pop      rsi
         ret

The .NET 9 native AOT codegen looks almost perfect to me. It is however still a lot slower than JIT.

That brings us to the MyGenerator.MoveNext method. JIT:

; Assembly listing for method MyGenerator:MoveNext():ubyte:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; partially interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 10625
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data

G_M000_IG01:                ;; offset=0x0000

G_M000_IG02:                ;; offset=0x0000
       cmp      byte  ptr [rcx+0x10], 0
       jne      SHORT G_M000_IG07
       mov      edx, dword ptr [rcx+0x08]
       inc      edx
       mov      dword ptr [rcx+0x08], edx
       cmp      edx, 0x3E8
       jg       SHORT G_M000_IG05

G_M000_IG03:                ;; offset=0x0016
       mov      dword ptr [rcx+0x0C], edx
       mov      eax, 1

G_M000_IG04:                ;; offset=0x001E
       ret

G_M000_IG05:                ;; offset=0x001F
       mov      byte  ptr [rcx+0x10], 1
       xor      eax, eax

G_M000_IG06:                ;; offset=0x0025
       ret

G_M000_IG07:                ;; offset=0x0026
       xor      eax, eax

G_M000_IG08:                ;; offset=0x0028
       ret

; Total bytes of code 41

AOT:

  ; Assembly listing for method MyGenerator:MoveNext():ubyte:this (FullOpts)
  ; Emitting BLENDED_CODE for generic X64 - Windows
  ; FullOpts code
  ; NativeAOT compilation
  ; optimized code
  ; rsp based frame
  ; partially interruptible

  G_M000_IG01:                ;; offset=0x0000

  G_M000_IG02:                ;; offset=0x0000
         cmp      byte  ptr [rcx+0x10], 0
         je       SHORT G_M000_IG05

  G_M000_IG03:                ;; offset=0x0006
         xor      eax, eax

  G_M000_IG04:                ;; offset=0x0008
         ret

  G_M000_IG05:                ;; offset=0x0009
         mov      edx, dword ptr [rcx+0x08]
         inc      edx
         mov      dword ptr [rcx+0x08], edx
         cmp      edx, 0x3E8
         jle      SHORT G_M000_IG07
         mov      byte  ptr [rcx+0x10], 1
         xor      eax, eax

  G_M000_IG06:                ;; offset=0x001F
         ret

  G_M000_IG07:                ;; offset=0x0020
         mov      dword ptr [rcx+0x0C], edx
         mov      eax, 1

  G_M000_IG08:                ;; offset=0x0028
         ret

  ; Total bytes of code 41

I can't explain why AOT is slower based on this.

Note that the above was with zero PGO use for AOT. The compiler can devirtualize these.

jkotas commented 11 months ago

JIT version uses guarded devirtualizations and inlines the MoveNext method under the guard. AOT version does not inline the MoveNext method. The call overhead is responsible for the difference.

Is the static PGO expected to handle this case?

HighPerfDotNet commented 11 months ago

Ok, as suggested I used this for profiling -

set DOTNET_TieredPGO_InstrumentOnlyHotCode=0

This resulted in slightly bigger trace data files. In project file I used the following options -

    <IlcArg Include="--mibc%3apgo.mibc" />
    <IlcArg Include="--Ot" />
    <IlcArg Include="--instruction-set=x86-x64-v3" />
    <IlcArg Include="--parallelism=1" />
    <IlcArg Include="--codegenopt:JitDisasm=&lt;Main&gt;$" />
    <IlcArg Include="--codegenopt:JitStdOutFile=PgoExperiment_WithPGO.asm" />

JIT version runs quickest at 1000 ms per iteration. No PGO Native AOT runs at around 1744 ms per iteration, asm file (307 lines) for Main on top says:

; No PGO data
; 21 inlinees with PGO data; 34 single block inlinees; 1 inlinees without PGO data

Old PGO runs at around same time, asm file (2837 lines) on top says:

; with Blended PGO: edge weights are invalid, and fgCalledCount is 100
; 47 inlinees with PGO data; 69 single block inlinees; 3 inlinees without PGO data

What's the significance of "edge weights are invalid"? What what does fgCalledCount mean, is 100 good high number?

With new PGO settings Native AOT is quicker (!) - around 1300 ms per iteration, asm file (710 lines):

; with Blended PGO: edge weights are invalid, and fgCalledCount is 100
; 48 inlinees with PGO data; 69 single block inlinees; 2 inlinees without PGO data

+1 inlinee with PGO data and -1 without.

However there is still no inlining of MoveNext, which is probably what accounts for remaining 300 ms difference with JIT now, screen from profiler - Profiler_WithPGO

Assembly difference between seem minor, yet that resulted in improved perf -

AsmDiff_New_vs_Old_PGO.txt

This is the first marked improvement with Static PGO that I can observe on my Intel system with this small test, but a bigger test with my big app did not move the needle in the right direction.

I am going to test with .NET 9 now.

HighPerfDotNet commented 11 months ago

.NET 8 JIT:

Iteration 0: 1149 ms. Iteration 1: 990 ms. Iteration 2: 998 ms. Iteration 3: 1002 ms.

.NET 9 (9.0.100-alpha.1.23579.33) JIT (slower!!!):

Iteration 0: 1393 ms Iteration 1: 1180 ms Iteration 2: 1179 ms Iteration 3: 1196 ms

.NET 8 AOT with PGO:

Iteration 0: 1293 ms. Iteration 1: 1306 ms. Iteration 2: 1309 ms. Iteration 3: 1308 ms.

.NET 9 AOT with PGO:

Iteration 0: 1316 ms Iteration 1: 1298 ms Iteration 2: 1299 ms Iteration 3: 1307 ms

HighPerfDotNet commented 11 months ago

What setting should be used to log inliner's decisions?

Is there a comprehensive list of all available (even if not officially supported) DOTNET_ like knobs/settings that allow to see internal logic better?

MichalStrehovsky commented 11 months ago

I had a look at the JitDump for both cases and I think I see what the problem is. The relevant part seems to be in the inlining decision:

JIT:

4 ldfld or stfld over arguments which are structs.  Multiplier increased to 1.
Inline candidate has 1 binary expressions with constants.  Multiplier increased to 1.5.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 2.5.
Inline candidate callsite is in a loop.  Multiplier increased to 5.5.
Callsite has profile data: 1.51962e+06.  Multiplier limited to 24.75.
calleeNativeSizeEstimate=678
callsiteNativeSizeEstimate=85
benefit multiplier=24.75
threshold=2103
Native estimate for function size is within threshold for inlining 67.8 <= 210.3 (multiplier = 24.75)

AOT:

4 ldfld or stfld over arguments which are structs.  Multiplier increased to 1.
Inline candidate has 1 binary expressions with constants.  Multiplier increased to 1.5.
Inline candidate has an arg that feeds a constant test.  Multiplier increased to 2.5.
Inline candidate callsite is in a loop.  Multiplier increased to 5.5.
calleeNativeSizeEstimate=678
callsiteNativeSizeEstimate=85
benefit multiplier=5.5
threshold=467
Native estimate for function size exceeds threshold for inlining 67.8 > 46.7 (multiplier = 5.5)

Inline expansion aborted, inline not profitable

The AOT case is missing Callsite has profile data: 1.51962e+06. Multiplier limited to 24.75.. Did we forget there was profile data on the call in the AOT case?

Here's the JitDumps for someone who's more of an expert to confirm:

jitdump_test_aot.txt jitdump_test_jit.txt

I wonder if this is relevant to the discussion we had here: https://github.com/dotnet/runtime/pull/86551#issuecomment-1557037420. Cc @EgorBo

MichalStrehovsky commented 11 months ago

What setting should be used to log inliner's decisions?

You can use DOTNET_JitDump=... (same syntax as JitDisasm) to see everything the JIT did. But you need a debug/checked build of the JIT. This doesn't work in the shipping configuration. I don't know if there's a more restrictive setting. The full list is in code. Some of these have docs in markdown elsewhere in this repo: https://github.com/dotnet/runtime/blob/8c07adcda9ed72249e47fe7e7e5e8a984333251e/src/coreclr/jit/jitconfigvalues.h

HighPerfDotNet commented 11 months ago

Excellent, Michael, thank you, is there a checked/debug build binaries available for download?

I assume you've used MIBC profile data, but just in case not here is the MIBC file that I've collected

pgo_mibc.zip

It sure feels like PGO data isn't reaching where it should be, but is it possible to check that it was actually correctly collected and it's present in MIBC file in same sufficient quality that JIT is seeing?

MichalStrehovsky commented 11 months ago

I collected my own PGO data and used that. I think it was incorporated because the JIT dump has *************** Starting PHASE Profile incorporation.

I don't think we have any packages with a debug JIT. You'll have to build it. If you're lucky (like I was), you can just run build.sh clr.jit -c Debug from the repo root of the main branch and replace clrjit.so in the latest .NET 9 dotnet installer with the one you just built. It doesn't always work because sometimes there's changes to the interface, but it worked today. Otherwise the full instructions are at https://github.com/dotnet/runtime/tree/main/docs/workflow/building/coreclr.

HighPerfDotNet commented 11 months ago

So, looks like collected PGO data is not reaching or not taken into account by JIT to make better inline decisions when running in AOT scenario.

This would explain why our much larger app also not seeing any improvements with static PGO data.

ghost commented 11 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Hello, I've been trying to get Static PGO working in Native AOT builds, something that is really necessary due otherwise lower performance observed by me in such builds. For this repro I've tried to follow one rare example how it should be working, from here: https://znlive.com/how-to-use-pgo-to-improve-the-performance (Static PGO seems to be very rarely used it seems, which made me sad). I've followed all instructions and finally collected profile stats into MIBC file. However after that all my attempts to get this data to be actually used failed - resulting binaries were as slow as without profile data. Here are outputs from builds: 1. Dynamic PGO - as expected first iteration is slow, but then things speed up: > Iteration 0: 1212 ms. > Iteration 1: 998 ms. > Iteration 2: 1012 ms. > Iteration 3: 1043 ms. > Iteration 4: 1025 ms. > Iteration 5: 1020 ms. > Iteration 6: 1033 ms. > Iteration 7: 1028 ms. > Iteration 8: 1031 ms. > Iteration 9: 1011 ms. 2. Native AOT version as is, without Static PGO applied - > Iteration 0: 1542 ms. > Iteration 1: 1533 ms. > Iteration 2: 1561 ms. > Iteration 3: 1564 ms. > Iteration 4: 1556 ms. > Iteration 5: 1537 ms. > Iteration 6: 1554 ms. > Iteration 7: 1567 ms. > Iteration 8: 1563 ms. > Iteration 9: 1643 ms. 3. Native AOT version supposedly WITH Static PGO: > Iteration 0: 1546 ms. > Iteration 1: 1529 ms. > Iteration 2: 1527 ms. > Iteration 3: 1533 ms. > Iteration 4: 1522 ms. > Iteration 5: 1533 ms. > Iteration 6: 1541 ms. > Iteration 7: 1535 ms. > Iteration 8: 1534 ms. > Iteration 9: 1524 ms. Binaries are nearly identical also, to it seems that PGO data is just being ignored. I've read practically everything I cound find on this subject on the web and also here in old issues, typical error appears to have been passing multiple parameters to crossgen2, I've tried those that are supposed to work and all sort of combinations all to no avail. My current params are: /p:PublishReadyToRunCrossgen2ExtraArgs=--embed-pgo-data%3b--mibc%3apgo.mibc%3b--compilebubblegenerics Note that --compilebubblegenerics was added based on reading this old issue https://github.com/dotnet/runtime/issues/55695 I hope that I've made some silly mistake in parameters, however at this point I've exhausted all possibilities and would love to hear from people who know this stuff better than me, attaching small repro with MIBC file. I had it run with logging ( -v:n passed to dotnet) and in log noticed the following: > Skipping target "_CreateR2RImages" because it has no outputs. I don't know if this is significant, binaries are being created, and no play with options made this change - my general feeling is that crossgen2 isn't used at all, there is no mention of it in the generated log. I can't see what I am doing wrong though as this is a very basic default project that in theory should work just fine. HELP! ### Reproduction Steps [StaticPGO.zip](https://github.com/dotnet/runtime/files/13465486/StaticPGO.zip) File includes the following: build_all.bat - batch file to build executables with and without PGO being applied build_no_pgo.bat - build batch file without PGO file provided build_with_pgo.bat - build batch file with PGO mibc file provided pgo.mibc - profile data captured for this application (removed scripts to do it for simplicity but can provide) PgoExperiment.csproj - project file Program.cs - simple test source code ### Expected behavior MIBC file with application profile data should have been taken into account to improve performance of Native AOT build, which is otherwise much slower than non-native AOT build. ### Actual behavior Based on performance PGO file appears to be ignored completely and binaries are almost idential also, apart from a few bytes (time/version stamps most likely). ### Regression? Yes - based on that it appears to have worked in the past with .NET 6 ### Known Workarounds None since Native AOT is only kind-of-ready now with .NET 8 release. ### Configuration .NET 8.0.0 with SDK Current Visual Studio 2022 17.8.1 Windows 10 Intel 9900K CPU, 32 GB RAM ### Other information As a side note - Native AOT is great, but it does require Static PGO to get closer (or maybe even ahead) or normal builds, as right now it is noticeably slower, say in one of our big projects I've observed 20-25% perf regression compared with normal builds, BUT memory usage was a LOT better - so using Static PGO is totally essential in my view, which in theory should be simple but in practice I've run into difficulty getting it off the ground, not least due to this reported issue. Also dotnet-pgo.exe should be made available as part of SDK, old executable that can be obtained for .NET 6 does not work on .NET 8 (which is understandable), and it was rather challenging to get it build from sources (run into issues there that don't want to bring up here), even though I was successful in the end in getting it I think this tool is necessary to be provided in binary form in SDK.
Author: HighPerfDotNet
Assignees: -
Labels: `tenet-performance`, `area-CodeGen-coreclr`
Milestone: 9.0.0