Open HighPerfDotNet opened 11 months ago
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.
Author: | HighPerfDotNet |
---|---|
Assignees: | - |
Labels: | `area-crossgen2-coreclr`, `untriaged`, `area-NativeAOT-coreclr` |
Milestone: | - |
/p:PublishReadyToRunCrossgen2ExtraArgs=--embed-pgo-data%3b--mibc%3apgo.mibc%3b--compilebubblegenerics
You are passing the arguments to the crossgen2, which is used for R2R, not NativeAOT. To use static PGO with NativeAOT, you need to pass the pgo data to ilc:
<ItemGroup>
<IlcArg Include="--mibc%3apgo.mibc" />
</ItemGroup>
Thank you for your suggestion, I have not seen anything regarding passing MIBC to Llc (there are like under 20 matches in Google for "ItemGroup IlcArg"), I assume this just needs to be added to the project file, which I did and new compile results in a slightly different file in size, so SOMETHING is happening, but, I am not seeing any performance difference compared to same Native AOT build without PGO file, so it's still does not appear to work. :-(
I've tried to compile without PublishAot set to false, and that indeed kicked in crossgen2, however this form of AOT isn't what I am looking for, I clearly wrongly thought crossgen2 is what will get to "real" Native AOT with all the expected memory savings, that as I currently understand can't be expected from R2R stuff.
I just checked your pgo.mibc file, apparently it's not the pgo data of your program (you can verify this by using dotnet-pgo dump -i pgo.mibc -o pgo.txt
).
Some notes:
dotnet run
, instead use dotnet /path/to/your/assembly
, otherwise you could possibliy collect the tracing data for the dotnet run
utils.dotnet-pgo
, the trace.nettrace
will be overwritten by the tracing data of dotnet-pgo
itself. Ah, I see - for collecting traces I used command line from that sample blog entry - it did surprise me that the trace file was rather large, but I did not pay much attention to it since it looked like all worked. In order to collect data previously I used the following:
_
set DOTNET_EnableEventPipe=1 set DOTNET_EventPipeConfig=Microsoft-Windows-DotNETRuntime:0x1F000080018:5 rem # Trace file output path set DOTNET_EventPipeOutputPath=trace.nettrace rem # Disable AOT set DOTNET_ReadyToRun=0 rem # Enable Tiered PGO set DOTNET_TieredPGO=1 rem # Never generate tier 1 code set DOTNET_TC_CallCounting=0 set DOTNET_TC_QuickJitForLoops=1 set DOTNET_JitCollect64BitCounts=1
dotnet run -c Release
Looks like THAT collected far more data, or something else happened. Based on your advice I now used -
dotnet bin\Release\net8.0\win-x64\PgoExperiment.dll
This collected a LOT LESS data, so it does look legit! So, all that good so far, BUT here is the same problem still - profile stats did not improve performance one bit, getting slightly different binaries however with and without MIBC file.
Here is updated ZIP with now two projects - one that passes MIBC file to ILC and the other does not, binaries are slightly different now, sufficiently to feel that PGO was fed, but there is still problem - zero effect on perf.
build_all.bat builds both versions, and I've included batch files I used to collect and convert traces. MIBC dump file appears to be correct to me
PgoExperiment_NoPGO.exe
Iteration 0: 1520 ms. Iteration 1: 1507 ms. Iteration 2: 1516 ms. Iteration 3: 1517 ms. Iteration 4: 1523 ms. Iteration 5: 1515 ms. Iteration 6: 1517 ms. Iteration 7: 1511 ms. Iteration 8: 1512 ms. Iteration 9: 1515 ms.
PgoExperiment_WithPGO.exe
Iteration 0: 1534 ms. Iteration 1: 1519 ms. Iteration 2: 1516 ms. Iteration 3: 1524 ms. Iteration 4: 1510 ms. Iteration 5: 1511 ms. Iteration 6: 1521 ms. Iteration 7: 1526 ms. Iteration 8: 1520 ms. Iteration 9: 1508 ms.
Where as JITed version gives consistently expected much better perf -
bin\Release\net8.0\win-x64\PgoExperiment_WithPGO.exe
Iteration 0: 1209 ms. Iteration 1: 991 ms. Iteration 2: 987 ms. Iteration 3: 1011 ms. Iteration 4: 994 ms. Iteration 5: 990 ms. Iteration 6: 990 ms. Iteration 7: 992 ms. Iteration 8: 994 ms. Iteration 9: 994 ms.
I am going to log off to catch some sleep now, and plan to test much bigger real app that we have in hope that MAYBE for this small test app collected PGO was not enough, even though in theory it should have been sufficient?
Ok, I've tested actual big app (can't give source unfortunately) that we have on real data, I've double checked that: 1) PGO tracing stats are collected correctly (looked at text dump) 2) ILC was provided MIBC file in project, when wrong file was used it failed saying it can't get it
Here is what I've got (time in seconds)
.NET 8 (JIT): 257.987 259.523 252.885
.NET 8 Native AOT: 407.059 405.663 407.269
.NET 8 Native AOT + PGO 405.984 409.598 403.084
No effect basically and as you can see Native AOT builds are significantly slower (but with a LOT better memory usage!), I've passed --verbose to ILC and could not see anything about PGO or MIBC. I've tried looking at the runtime source code and found in runtime\src\coreclr\nativeaot\BuildIntegration\Microsoft.NETCore.Native.targets interesting stuff -
<IlcPgoOptimize Condition="'$(IlcPgoOptimize)' == '' and '$(OptimizationPreference)' == 'Speed'">true</IlcPgoOptimize>
and
<ItemGroup>
<MibcFile Include="$(IlcMibcPath)*.mibc" Condition="'$(IlcPgoOptimize)' == 'true'" />
</ItemGroup>
This all seems very relevant, but I am not expert on MSBuild and not sure how to interprete those - would that condition turn on if MIBC files were given? Providing just path via --mibc command line fails, so specific file needs to be given there.
There ought to exist a working short example that demonstrates that PGO actually works with Native AOT - so far my feeling is that it has no effect, perhaps due to some other switch that needs to be provided?
Seems that you need to pass /p:OptimizationPreference=Speed
as well.
I believe I've got that in effect from the project file via -
<IlcArg Include="--Ot" />
I've verified that changing that value to --Os optimises for size.
I also have this in the project file -
<OptimizationPreference>Speed</OptimizationPreference>
I've added /p:OptimizationPreference=Speed to that ZIPped project builds and it has zero effect - binaries almost identical, just few bytes different.
One VERY strange thing I noticed is that sometimes I get execution of around 1500 ms on my system, but in other times 1700 ms. When optimising for Size I get speed of around 1950-2000 ms. I get this VERY weird feeling that middle range perf is default optimisation value, but quiet why would that be different from requested Speed I don't know.
I am going to try to see is actual non-Native AOT stuff that was used in blog post on top (ie R2R from crossgen2) will actually take MIBC file into account and improve perf, my guess is that R2R won't be substantially different memory-wise from JIT, and the main attraction of Native AOT for me is to benefit from clear substantial memory reductions, but I see no valid reason why performance should suffer so much, if anything I would have expected a bit faster execution from stripped down model.
Apologies for posting too much, but something interesting is really going on here: I've moved testing to another box: AMD 7800X3D, and was able to finally SEE performance difference between two binaries:
JIT (much much quicker than 9900K):
Iteration 0: 663 ms.
Iteration 1: 435 ms.
Iteration 2: 431 ms.
Iteration 3: 431 ms.
Iteration 4: 432 ms.
No PGO (only slightly quicker than 9900K - which is what I expected):
Iteration 0: 1612 ms.
Iteration 1: 1619 ms.
Iteration 2: 1679 ms.
Iteration 3: 1654 ms.
Iteration 4: 1652 ms.
With PGO:
Iteration 0: 1427 ms
Iteration 1: 1447 ms
Iteration 2: 1455 ms
Iteration 3: 1436 ms
Iteration 4: 1443 ms
It's finally quicker! Not by much, but still - clearly PGO data HAD SOME EFFECT!
Now here is even more interesting part, I've changed source code to have top level loop to have more than 10 iterations, and when that number was 30 or higher, I've got consistently improved perf, but also ONLY on AMD:
With PGO (loop 30+)
Iteration 0: 1228 ms.
Iteration 1: 1218 ms.
Iteration 2: 1212 ms.
Iteration 3: 1233 ms.
Iteration 4: 1213 ms.
Again, this was ONLY on AMD hardware - I've tested same binaries on another 9900k without .NET 8, a Skylake class server, and also Zen 2 server - which also shown quicker execution for build with PGO.
Now my new AMD box is with lots of L3 cache, but I doubt very much this is so important in this case - whole program is just 1.2MB, so it fits into L3 cache on all processors that I've tested.
Sadly most of our servers are Intel, so I'd prefer speed ups to be the other way, but in any case even on AMD JIT version is crazy quick - more than double speed of 9900K and, most importatly, still 3 times quicker than Native AOT version with Static PGO.
Will continue to investigate...
I think the performance regression in nativeaot might be related to usage to AVX internally. By default nativeaot doesn't use AVX by default, to enable this you need to set:
<PropertyGroup>
<IlcInstructionSet>native</IlcInstructionSet>
</PropertyGroup>
cc @MichalStrehovsky if there's anything else needed to make static pgo work
I've tried setting different sets, ie x86-x64-v3 or Skylake, tried the above Native also - no difference, which is expected as IMHO this small piece of code has not got much scope for AVX related stuff. It does not call internal logic either that could have benefited much.
I've been thinking and searching past issues and saw this one -
https://github.com/dotnet/runtime/issues/94052
It seems to me that this is a far better approach than current collection of Static PGO and its analysis by dotnet-pgo - collecting this data from JIT itself whilst running over sample data seems far more logical to me - that should ensure static compilation gets the same data as JIT got, so at the very least there should be no practical difference in performance between JIT and Native AOT - certainly not having JIT being MUCH quicker.
I am going to dabble now into disasm territory and hopefully get some clues from it, but my asm skills are rather rusty...
I am going to dabble now into disasm territory
You may want to run it under sampling profiler to see the distribution of the slow down. Is the slow down distributed evenly over all methods of your app or is it concentrated in a few methods?
Yes, I'll do that now, got vTune here. The "app in question" isn't mine, I've provided repro above in file - PgoExperiment.zip
I've also got "my apps", but they are sadly too big to be useful for analysis here, the reason I mentioned it is so that it does not look like this small repro is isolated artificial example.
Ok, here is top level view from Intels vTune Profiler (this is on Intel 9900K CPU, using hardware based event sampling, 1 ms):
JIT based:
Native AOT with PGO:
It seems to me that MoveNext isn't getting inlined here perhaps?
PGO dump file mentions this with some stats -
"Method": "[S.P.CoreLib]System.Runtime.CompilerServices.ConditionalWeakTable`2+Enumerator<System.__Canon,System.__Canon>.MoveNext()",
Source code (.CS file with extra txt extention): Program.cs.txt
Perhaps the core of the issue is bad PGO collection? That method is called a LOT of times, but it seems to me this isn't that obvious from the data made available in MIBC file's dump - I don't know enough about it to interprete though, but one would have thought high call count would be reflected in big numbers there.
"Method": "[S.P.CoreLib]System.Runtime.CompilerServices.ConditionalWeakTable`2+Enumerator<System.Canon,System.Canon>.MoveNext()", "InstrumentationData": [ { "ILOffset": 41, "InstrumentationKind": "EdgeLongCount", "Other": 93, "Data": 0 }, { "ILOffset": 93, "InstrumentationKind": "EdgeLongCount", "Other": 107, "Data": 2 }, { "ILOffset": 107, "InstrumentationKind": "EdgeLongCount", "Other": 119, "Data": 2 }, { "ILOffset": 112, "InstrumentationKind": "EdgeLongCount", "Other": 118, "Data": 4 }, { "ILOffset": 118, "InstrumentationKind": "EdgeLongCount", "Other": 109, "Data": 4 }, { "ILOffset": 119, "InstrumentationKind": "EdgeLongCount", "Other": 0, "Data": 2 }, { "ILOffset": 121, "InstrumentationKind": "EdgeLongCount", "Other": 0, "Data": 2 } ]
If you can dump the jit disasm for the hot methods you should be able to see if the JIT is being fed profile data and roughly how much of it was available.
I think this can be done with something like
<IlcArg Include="--codegenopt JitDisasm=Program::<Main>$" />
<IlcArg Include="--codegenopt JitStdOutFile=... some file path..." />
where ...some file path...
is an existing directory plus file name to use for the output.
If the profile data is thin the jit may not be willing to bet on it for key decisions like inlining. You should force your profiling run to produce instrumented code right away, and (as you were doing above) force methods to remain in Tier0 + instrumented code for longer so that the MIBC file has more counts.
So try adding
set DOTNET_TieredPGO_InstrumentOnlyHotCode=0
to the set of env vars you are using for the profiling run.
The sample code shouldn't even need PGO data on Native AOT. The interface calls should trivially be devirtualized because there's only one implementation in the entire program. I don't see quite that happening in .NET 8 for some reason, but it does happen on .NET 9 (main branch).
.NET 8 native AOT disassembly of Test
:
; Assembly listing for method Program:Test(IGeneratorFactory):int (FullOpts)
; Emitting BLENDED_CODE for generic X64 - Windows
; FullOpts code
; NativeAOT compilation
; optimized code
; rsp based frame
; partially interruptible
G_M000_IG01: ;; offset=0x0000
push rsi
push rbx
sub rsp, 40
G_M000_IG02: ;; offset=0x0006
mov rax, qword ptr [rcx]
call [rax+0x30]IGeneratorFactory:CreateGenerator():IGenerator:this
mov rbx, rax
xor esi, esi
mov rcx, rbx
cmp dword ptr [rcx], ecx
call MyGenerator:MoveNext():bool:this
test eax, eax
je SHORT G_M000_IG04
G_M000_IG03: ;; offset=0x001F
add esi, dword ptr [rbx+0x0C]
mov rcx, rbx
call MyGenerator:MoveNext():bool:this
test eax, eax
jne SHORT G_M000_IG03
G_M000_IG04: ;; offset=0x002E
mov eax, esi
G_M000_IG05: ;; offset=0x0030
add rsp, 40
pop rbx
pop rsi
ret
; Total bytes of code 55
.NET 8 JIT disassembly:
; Assembly listing for method Program:Test(IGeneratorFactory):int (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; fully interruptible
; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 18
; 1 inlinees with PGO data; 7 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
push rdi
push rsi
push rbx
sub rsp, 32
G_M000_IG02: ;; offset=0x0007
mov rax, 0x7FFF5BF8DB38
cmp qword ptr [rcx], rax
jne G_M000_IG13
mov rcx, 0x7FFF5BF8F9E8
call CORINFO_HELP_NEWSFAST
mov rbx, rax
G_M000_IG03: ;; offset=0x002C
xor esi, esi
test rbx, rbx
je SHORT G_M000_IG07
mov rcx, 0x7FFF5BF8F9E8
cmp qword ptr [rbx], rcx
jne SHORT G_M000_IG07
jmp SHORT G_M000_IG05
G_M000_IG04: ;; offset=0x0044
mov edi, dword ptr [rbx+0x0C]
add esi, edi
G_M000_IG05: ;; offset=0x0049
mov rcx, rbx
cmp byte ptr [rcx+0x10], 0
jne SHORT G_M000_IG11
mov ecx, dword ptr [rbx+0x08]
inc ecx
mov dword ptr [rbx+0x08], ecx
cmp ecx, 0x3E8
jg G_M000_IG14
mov dword ptr [rbx+0x0C], ecx
jmp SHORT G_M000_IG04
G_M000_IG06: ;; offset=0x006B
add esi, edi
G_M000_IG07: ;; offset=0x006D
mov rcx, 0x7FFF5BF8F9E8
cmp qword ptr [rbx], rcx
jne SHORT G_M000_IG10
mov rcx, rbx
cmp byte ptr [rcx+0x10], 0
jne SHORT G_M000_IG11
mov ecx, dword ptr [rbx+0x08]
inc ecx
mov dword ptr [rbx+0x08], ecx
cmp ecx, 0x3E8
jg SHORT G_M000_IG09
mov dword ptr [rbx+0x0C], ecx
mov edi, dword ptr [rbx+0x0C]
jmp SHORT G_M000_IG06
G_M000_IG08: ;; offset=0x009D
mov rcx, rbx
mov r11, 0x7FFF5BC70088
call [r11]IGenerator:get_Current():int:this
mov edi, eax
jmp SHORT G_M000_IG06
G_M000_IG09: ;; offset=0x00B1
mov byte ptr [rbx+0x10], 1
jmp SHORT G_M000_IG11
G_M000_IG10: ;; offset=0x00B7
mov rcx, rbx
mov r11, 0x7FFF5BC70080
call [r11]IGenerator:MoveNext():bool:this
test eax, eax
jne SHORT G_M000_IG08
G_M000_IG11: ;; offset=0x00CB
mov eax, esi
G_M000_IG12: ;; offset=0x00CD
add rsp, 32
pop rbx
pop rsi
pop rdi
ret
G_M000_IG13: ;; offset=0x00D5
mov rax, qword ptr [rcx]
mov rax, qword ptr [rax+0x40]
call [rax+0x20]IGeneratorFactory:CreateGenerator():IGenerator:this
mov rbx, rax
jmp G_M000_IG03
G_M000_IG14: ;; offset=0x00E7
mov byte ptr [rbx+0x10], 1
jmp SHORT G_M000_IG11
; Total bytes of code 237
The JIT version needs to use guarded devirtualizations but native AOT does not. For some reason there's a virtual call left behind to CreateGenerator
. This seems to be fixed in .NET 9 native AOT:
; Assembly listing for method Program:Test(IGeneratorFactory):int (FullOpts)
; Emitting BLENDED_CODE for generic X64 - Windows
; FullOpts code
; NativeAOT compilation
; optimized code
; rsp based frame
; partially interruptible
G_M000_IG01: ;; offset=0x0000
push rsi
push rbx
sub rsp, 40
G_M000_IG02: ;; offset=0x0006
cmp byte ptr [rcx], cl
lea rcx, [(reloc 0x4227e0)]
call CORINFO_HELP_NEWSFAST
mov rbx, rax
xor esi, esi
mov rcx, rbx
call MyGenerator:MoveNext():ubyte:this
test eax, eax
je SHORT G_M000_IG04
G_M000_IG03: ;; offset=0x0025
add esi, dword ptr [rbx+0x0C]
mov rcx, rbx
call MyGenerator:MoveNext():ubyte:this
test eax, eax
jne SHORT G_M000_IG03
G_M000_IG04: ;; offset=0x0034
mov eax, esi
G_M000_IG05: ;; offset=0x0036
add rsp, 40
pop rbx
pop rsi
ret
The .NET 9 native AOT codegen looks almost perfect to me. It is however still a lot slower than JIT.
That brings us to the MyGenerator.MoveNext
method. JIT:
; Assembly listing for method MyGenerator:MoveNext():ubyte:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; partially interruptible
; with Dynamic PGO: edge weights are valid, and fgCalledCount is 10625
; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data
G_M000_IG01: ;; offset=0x0000
G_M000_IG02: ;; offset=0x0000
cmp byte ptr [rcx+0x10], 0
jne SHORT G_M000_IG07
mov edx, dword ptr [rcx+0x08]
inc edx
mov dword ptr [rcx+0x08], edx
cmp edx, 0x3E8
jg SHORT G_M000_IG05
G_M000_IG03: ;; offset=0x0016
mov dword ptr [rcx+0x0C], edx
mov eax, 1
G_M000_IG04: ;; offset=0x001E
ret
G_M000_IG05: ;; offset=0x001F
mov byte ptr [rcx+0x10], 1
xor eax, eax
G_M000_IG06: ;; offset=0x0025
ret
G_M000_IG07: ;; offset=0x0026
xor eax, eax
G_M000_IG08: ;; offset=0x0028
ret
; Total bytes of code 41
AOT:
; Assembly listing for method MyGenerator:MoveNext():ubyte:this (FullOpts)
; Emitting BLENDED_CODE for generic X64 - Windows
; FullOpts code
; NativeAOT compilation
; optimized code
; rsp based frame
; partially interruptible
G_M000_IG01: ;; offset=0x0000
G_M000_IG02: ;; offset=0x0000
cmp byte ptr [rcx+0x10], 0
je SHORT G_M000_IG05
G_M000_IG03: ;; offset=0x0006
xor eax, eax
G_M000_IG04: ;; offset=0x0008
ret
G_M000_IG05: ;; offset=0x0009
mov edx, dword ptr [rcx+0x08]
inc edx
mov dword ptr [rcx+0x08], edx
cmp edx, 0x3E8
jle SHORT G_M000_IG07
mov byte ptr [rcx+0x10], 1
xor eax, eax
G_M000_IG06: ;; offset=0x001F
ret
G_M000_IG07: ;; offset=0x0020
mov dword ptr [rcx+0x0C], edx
mov eax, 1
G_M000_IG08: ;; offset=0x0028
ret
; Total bytes of code 41
I can't explain why AOT is slower based on this.
Note that the above was with zero PGO use for AOT. The compiler can devirtualize these.
JIT version uses guarded devirtualizations and inlines the MoveNext method under the guard. AOT version does not inline the MoveNext method. The call overhead is responsible for the difference.
Is the static PGO expected to handle this case?
Ok, as suggested I used this for profiling -
set DOTNET_TieredPGO_InstrumentOnlyHotCode=0
This resulted in slightly bigger trace data files. In project file I used the following options -
<IlcArg Include="--mibc%3apgo.mibc" />
<IlcArg Include="--Ot" />
<IlcArg Include="--instruction-set=x86-x64-v3" />
<IlcArg Include="--parallelism=1" />
<IlcArg Include="--codegenopt:JitDisasm=<Main>$" />
<IlcArg Include="--codegenopt:JitStdOutFile=PgoExperiment_WithPGO.asm" />
JIT version runs quickest at 1000 ms per iteration. No PGO Native AOT runs at around 1744 ms per iteration, asm file (307 lines) for Main on top says:
; No PGO data
; 21 inlinees with PGO data; 34 single block inlinees; 1 inlinees without PGO data
Old PGO runs at around same time, asm file (2837 lines) on top says:
; with Blended PGO: edge weights are invalid, and fgCalledCount is 100
; 47 inlinees with PGO data; 69 single block inlinees; 3 inlinees without PGO data
What's the significance of "edge weights are invalid"? What what does fgCalledCount mean, is 100 good high number?
With new PGO settings Native AOT is quicker (!) - around 1300 ms per iteration, asm file (710 lines):
; with Blended PGO: edge weights are invalid, and fgCalledCount is 100
; 48 inlinees with PGO data; 69 single block inlinees; 2 inlinees without PGO data
+1 inlinee with PGO data and -1 without.
However there is still no inlining of MoveNext, which is probably what accounts for remaining 300 ms difference with JIT now, screen from profiler -
Assembly difference between seem minor, yet that resulted in improved perf -
This is the first marked improvement with Static PGO that I can observe on my Intel system with this small test, but a bigger test with my big app did not move the needle in the right direction.
I am going to test with .NET 9 now.
.NET 8 JIT:
Iteration 0: 1149 ms. Iteration 1: 990 ms. Iteration 2: 998 ms. Iteration 3: 1002 ms.
.NET 9 (9.0.100-alpha.1.23579.33) JIT (slower!!!):
Iteration 0: 1393 ms Iteration 1: 1180 ms Iteration 2: 1179 ms Iteration 3: 1196 ms
.NET 8 AOT with PGO:
Iteration 0: 1293 ms. Iteration 1: 1306 ms. Iteration 2: 1309 ms. Iteration 3: 1308 ms.
.NET 9 AOT with PGO:
Iteration 0: 1316 ms Iteration 1: 1298 ms Iteration 2: 1299 ms Iteration 3: 1307 ms
What setting should be used to log inliner's decisions?
Is there a comprehensive list of all available (even if not officially supported) DOTNET_ like knobs/settings that allow to see internal logic better?
I had a look at the JitDump for both cases and I think I see what the problem is. The relevant part seems to be in the inlining decision:
JIT:
4 ldfld or stfld over arguments which are structs. Multiplier increased to 1.
Inline candidate has 1 binary expressions with constants. Multiplier increased to 1.5.
Inline candidate has an arg that feeds a constant test. Multiplier increased to 2.5.
Inline candidate callsite is in a loop. Multiplier increased to 5.5.
Callsite has profile data: 1.51962e+06. Multiplier limited to 24.75.
calleeNativeSizeEstimate=678
callsiteNativeSizeEstimate=85
benefit multiplier=24.75
threshold=2103
Native estimate for function size is within threshold for inlining 67.8 <= 210.3 (multiplier = 24.75)
AOT:
4 ldfld or stfld over arguments which are structs. Multiplier increased to 1.
Inline candidate has 1 binary expressions with constants. Multiplier increased to 1.5.
Inline candidate has an arg that feeds a constant test. Multiplier increased to 2.5.
Inline candidate callsite is in a loop. Multiplier increased to 5.5.
calleeNativeSizeEstimate=678
callsiteNativeSizeEstimate=85
benefit multiplier=5.5
threshold=467
Native estimate for function size exceeds threshold for inlining 67.8 > 46.7 (multiplier = 5.5)
Inline expansion aborted, inline not profitable
The AOT case is missing Callsite has profile data: 1.51962e+06. Multiplier limited to 24.75.
. Did we forget there was profile data on the call in the AOT case?
Here's the JitDumps for someone who's more of an expert to confirm:
jitdump_test_aot.txt jitdump_test_jit.txt
I wonder if this is relevant to the discussion we had here: https://github.com/dotnet/runtime/pull/86551#issuecomment-1557037420. Cc @EgorBo
What setting should be used to log inliner's decisions?
You can use DOTNET_JitDump=...
(same syntax as JitDisasm) to see everything the JIT did. But you need a debug/checked build of the JIT. This doesn't work in the shipping configuration. I don't know if there's a more restrictive setting. The full list is in code. Some of these have docs in markdown elsewhere in this repo: https://github.com/dotnet/runtime/blob/8c07adcda9ed72249e47fe7e7e5e8a984333251e/src/coreclr/jit/jitconfigvalues.h
Excellent, Michael, thank you, is there a checked/debug build binaries available for download?
I assume you've used MIBC profile data, but just in case not here is the MIBC file that I've collected
It sure feels like PGO data isn't reaching where it should be, but is it possible to check that it was actually correctly collected and it's present in MIBC file in same sufficient quality that JIT is seeing?
I collected my own PGO data and used that. I think it was incorporated because the JIT dump has *************** Starting PHASE Profile incorporation
.
I don't think we have any packages with a debug JIT. You'll have to build it. If you're lucky (like I was), you can just run build.sh clr.jit -c Debug
from the repo root of the main branch and replace clrjit.so in the latest .NET 9 dotnet installer with the one you just built. It doesn't always work because sometimes there's changes to the interface, but it worked today. Otherwise the full instructions are at https://github.com/dotnet/runtime/tree/main/docs/workflow/building/coreclr.
So, looks like collected PGO data is not reaching or not taken into account by JIT to make better inline decisions when running in AOT scenario.
This would explain why our much larger app also not seeing any improvements with static PGO data.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | HighPerfDotNet |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `area-CodeGen-coreclr` |
Milestone: | 9.0.0 |
Description
Hello,
I've been trying to get Static PGO working in Native AOT builds, something that is really necessary due otherwise lower performance observed by me in such builds. For this repro I've tried to follow one rare example how it should be working, from here: https://znlive.com/how-to-use-pgo-to-improve-the-performance (Static PGO seems to be very rarely used it seems, which made me sad).
I've followed all instructions and finally collected profile stats into MIBC file. However after that all my attempts to get this data to be actually used failed - resulting binaries were as slow as without profile data. Here are outputs from builds:
Native AOT version as is, without Static PGO applied -
Native AOT version supposedly WITH Static PGO:
Binaries are nearly identical also, to it seems that PGO data is just being ignored.
I've read practically everything I cound find on this subject on the web and also here in old issues, typical error appears to have been passing multiple parameters to crossgen2, I've tried those that are supposed to work and all sort of combinations all to no avail. My current params are:
/p:PublishReadyToRunCrossgen2ExtraArgs=--embed-pgo-data%3b--mibc%3apgo.mibc%3b--compilebubblegenerics
Note that --compilebubblegenerics was added based on reading this old issue https://github.com/dotnet/runtime/issues/55695
I hope that I've made some silly mistake in parameters, however at this point I've exhausted all possibilities and would love to hear from people who know this stuff better than me, attaching small repro with MIBC file.
I had it run with logging ( -v:n passed to dotnet) and in log noticed the following:
I don't know if this is significant, binaries are being created, and no play with options made this change - my general feeling is that crossgen2 isn't used at all, there is no mention of it in the generated log. I can't see what I am doing wrong though as this is a very basic default project that in theory should work just fine. HELP!
Reproduction Steps
StaticPGO.zip
File includes the following:
build_all.bat - batch file to build executables with and without PGO being applied build_no_pgo.bat - build batch file without PGO file provided build_with_pgo.bat - build batch file with PGO mibc file provided pgo.mibc - profile data captured for this application (removed scripts to do it for simplicity but can provide) PgoExperiment.csproj - project file Program.cs - simple test source code
Expected behavior
MIBC file with application profile data should have been taken into account to improve performance of Native AOT build, which is otherwise much slower than non-native AOT build.
Actual behavior
Based on performance PGO file appears to be ignored completely and binaries are almost idential also, apart from a few bytes (time/version stamps most likely).
Regression?
Yes - based on that it appears to have worked in the past with .NET 6
Known Workarounds
None since Native AOT is only kind-of-ready now with .NET 8 release.
Configuration
.NET 8.0.0 with SDK Current Visual Studio 2022 17.8.1 Windows 10 Intel 9900K CPU, 32 GB RAM
Other information
As a side note - Native AOT is great, but it does require Static PGO to get closer (or maybe even ahead) or normal builds, as right now it is noticeably slower, say in one of our big projects I've observed 20-25% perf regression compared with normal builds, BUT memory usage was a LOT better - so using Static PGO is totally essential in my view, which in theory should be simple but in practice I've run into difficulty getting it off the ground, not least due to this reported issue.
Also dotnet-pgo.exe should be made available as part of SDK, old executable that can be obtained for .NET 6 does not work on .NET 8 (which is understandable), and it was rather challenging to get it build from sources (run into issues there that don't want to bring up here), even though I was successful in the end in getting it I think this tool is necessary to be provided in binary form in SDK.