Closed danmoseley closed 1 year ago
Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.
Author: | danmoseley |
---|---|
Assignees: | - |
Labels: | `area-System.Runtime`, `tenet-performance`, `regression-from-last-release` |
Milestone: | 7.0.0 |
For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method).
Given the regression exists with PGO disabled, maybe it isn't critical.
@dotnet/jit-contrib can someone on the JIT team comment on @gfoidl 's findings above? We are trying to figure out why there has been a regression in some cases.
@stephentoub did you try a variety of tests (length of input and number of hits) and see a general regression?
with PGO disabled
We can only disable dynamic PGO (D-PGO), not the static PGO. Correct? So if static PGO is the culprit here, then JitDisablePGO=1 shouldn't show serious effect, which the numbers prove.
With JitDisasm
and JitDisablePGO
not set I get
; Assembly listing for method String:Replace(ushort,ushort):String:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Static PGO: edge weights are invalid, and fgCalledCount is 47166
; 1 inlinees with PGO data; 16 single block inlinees; 0 inlinees without PGO data
What is the meaning of "edge weights are invalid"?
With JitDisablePGO
set, then it's
; Assembly listing for method String:Replace(ushort,ushort):String:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; PGO data available, but JitDisablePgo > 0
; 0 inlinees with PGO data; 16 single block inlinees; 2 inlinees without PGO data
JitDisasm is really cool, big thanks for that feature 👍🏻
@gfoidl you mentioned you see the same regression vs 6.0 on this example, but do you still see the wins in the scenarios you tried while working on the PR -- or did they vanish somehow?
@DrewScoggins or @EgorBo can you help point me at recent results for our perf test here? I am not sure how to find them. Specifically I'd like to see what change they saw either side of the PR https://github.com/dotnet/runtime/pull/67049.
@dotnet/jit-contrib do you see anything in the assembly before and after that @gfoidl posted above? including the annotated version higher up.
We can only disable dynamic PGO (D-PGO), not the static PGO. Correct?
DOTNET_JitDisablePGO=1 should disable both. Static PGO specifically can be disabled by simply using DOTNET_ReadyToRun=0
What is the meaning of "edge weights are invalid"?
Nvm, it's just a sign that JIT made some mistakes calculating edges' weights - it happens in many cases.
@DrewScoggins or @EgorBo can you help point me at recent results for our perf test here?
Here is the link: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/TestHistoryIndexIndex.html You can open any machine, e.g. Ubuntu x64 and then find the benchmark you need via Ctrl+F, e.g. https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362%2fSystem.Tests.Perf_String.Replace_Char(text%3a%20%22Hello%22%2c%20oldChar%3a%20%27a%27%2c%20newChar%3a%20%27b%27).html
do you still see the wins in the scenarios you tried while working on the PR -- or did they vanish somehow?
While working on the PR I duplicated the code of string.Replace
in order to be able to compare both version (default and PR) within the same benchmark run on .NET 7. There was a clear win for the PR.
When I use the benchmark from your gist (.NET 6 vs .NET 7 and the bits provided by the SDK installer) with different input-strings (kind of manual random exploration) it seems like this is really weird. Most of the times .NET 6 is faster, even for inputs where the vectorized code-path isn't hit at all. Only if nothing needs to be replaced .NET 7 is consistently faster.
With the same inputs and the duplicated code (the benchmark run to compare old and new version of code with .NET 7) the new version is faster.
Bearing in mind the change went in on Aug 16.
Some scenarios show no change (but a regression earlier in the month)
Some show a big improvement
Some a regression
Not sure whether you can access these files @gfoidl I'm assuming not (and if not ideally we would fix that)
With the same inputs and the duplicated code (the benchmark run to compare old and new version of code with .NET 7) the new version is faster.
This is curious .. what could cause the standalone to be faster, but in the product slower .. especially if PGO is disabled... code alignment?
Yes I can access these files.
It's also interesting that arm64 / ubuntu 20.04 shows improvements, e.g.
Wasm (when I interpret "CompilationMode=wasm,RunKind=micro" correct) shows a regression, and I assume that wasm doesn't support vectors (or does it).
code alignment?
This is my bet. Some results in the charts look unstable -- which might be from alignment (amongst others).
What bothers me is that with the duplicated code the theory of "processing the remaining elements in one vectorized pass is faster" is confirmed. But with the shipped bits it is not clear, rather it regresses. For sure, theory is one thing and measurements the one real thing, but a couple of vector instructions should beat a scalar loop with up to 16 iterations (AVX machine).
@AndyAyersMS @kunalspathak is there some way we can help determine whether it is an alignment issue?
Not sure whether you can access these files @gfoidl I'm assuming not (and if not ideally we would fix that)
The perf history should be visible by anyone.
@AndyAyersMS @kunalspathak is there some way we can help determine whether it is an alignment issue?
For code alignment you can (with suitable checked jit) also set DOTNET_JitDasmWithAlignmentBoundaries=1
. But there is subtlety here, in particular around the interaction of jumps and 32 byte boundaries per the intel jcc erratum. We currently do not have the ability to mitigate this in the jit.
Give we're dealing with strings. data alignment may also play a role. You might try rerunning the BDN results with increased numbers of measurement intervals and memory randomization (--memoryRandomization --iterationCount 100
).
Running the benchmarks with --memoryRandomization true
doesn't change the picture in general for the comparison of .NET 6 <-> .NET 7.
What's a "with suitable checked jit"? The checked build with one can get the JIS-dasm?
What else can I do to move forward here?
@AndyAyersMS @kunalspathak is there some way we can help determine whether it is an alignment issue?
In both .NET 6 and .NET 7, the vectorized loop is not aligned and I don't see any JCC erratum coming in the way in those loop.s
One thing that I noticed is in .NET 6, Replace
is optimized (I believe with QuickJitLoopBody), but in .NET 7, it goes through tiering.
NET 7, it goes through tiering.
Is there an environment variable to try that would force it to be optimized? Although, I thought BDN defeats tiering.
NET 7, it goes through tiering.
Is there an environment variable to try that would force it to be optimized? Although, I thought BDN defeats tiering.
I did try COMPlus_TC_QuickJitForLoops=0
, but the codegen doesn't change. So don't think that is an issue.
I believe @stephentoub you said that you'd tried various scenarios (lengths, position of hits if any) and it was slower in general. So this is not a case of "are the improvements worth the regressions" -- we have to either fix the regression or revert this change for 7.0 to get back to a known state. Given the schedule, we need to do one or the other by the end of the week.
Is there a clear next step for investigation, or should we revert or 7.0? We can continue working on the problem in order to have the change in 8.0
I spent a bit more time running various tests. I suspect this is actually not related to the Replace PR and instead related more to something allocation-related, like regions in .NET 7. I see comparable regressions with these:
const string Input = """
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the year.
He gives his harness bells a shake
To ask if there is some mistake.
The only other sound’s the sweep
Of easy wind and downy flake.
The woods are lovely, dark and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
""";
private char[] _chars = Input.ToCharArray();
[Benchmark]
public string WithContent1() => new string(_chars);
[Benchmark]
public string WithContent2() => string.Create(Input.Length, Input, (dest, state) => state.AsSpan().CopyTo(dest));
[Benchmark]
public string WithoutContent1() => string.Create(Input.Length, Input, (dest, state) => { });
[Benchmark]
public string WithoutContent2() => new string('\0', Input.Length);
Method | Runtime | Mean | Ratio |
---|---|---|---|
WithContent1 | .NET 6.0 | 105.97 ns | 1.00 |
WithContent1 | .NET 7.0 | 120.65 ns | 1.15 |
WithContent2 | .NET 6.0 | 104.12 ns | 1.00 |
WithContent2 | .NET 7.0 | 122.60 ns | 1.18 |
WithoutContent1 | .NET 6.0 | 79.04 ns | 1.00 |
WithoutContent1 | .NET 7.0 | 103.15 ns | 1.30 |
WithoutContent2 | .NET 6.0 | 76.69 ns | 1.00 |
WithoutContent2 | .NET 7.0 | 100.13 ns | 1.31 |
Does disabling GC regions help?
Does disabling GC regions help?
Latest RC2:
Method | Mean |
---|---|
WithContent1 | 121.54 ns |
WithContent2 | 121.04 ns |
WithoutContent1 | 95.10 ns |
WithoutContent2 | 93.52 ns |
Latest RC2 w/ DOTNET_GCName="clrgc.dll"
Method | Mean |
---|---|
WithContent1 | 107.23 ns |
WithContent2 | 108.93 ns |
WithoutContent1 | 87.14 ns |
WithoutContent2 | 84.36 ns |
@mangod9, is this expected?
some of similar smaller microbenchmarks have regressed based on GC writer barrier work. Please see: https://github.com/dotnet/runtime/issues/74014. @PeterSolMS has investigated and determined that the write barrier work should help with most real world workloads. Does COMPLUS_GCWriteBarrier=3
help improve perf for this?
@dotnet/gc
Figured this was a good opportunity to make use of the new tooling we created during quality week that automates running microbenchmarks and generates comparative results. (CC: @dotnet/gc)
I was able to repro this regression locally (.NET7 execution time is much higher than that of .NET6 for longer strings) where I ran all microbenchmarks matching the filter System.Tests.Perf_String.Replace_Char*
for:
dotnet run -f net6.0 --filter System.Tests.Perf_String.Replace_Char -c Release --noOverwrite --memory --artifacts C:\String.Replace_Char\baseline
COMPlus_GCWriteBarrier: 1
- writebarrier_1
dotnet run -f net7.0 --filter System.Tests.Perf_String.Replace_Char -c Release --noOverwrite --memory --artifacts C:\String.Replace_Char\writebarrier_1
COMPlus_GCWriteBarrier=3
and COMPlus_EnableWriteXorExecute=0
- writebarrier_3
dotnet run -f net7.0 --filter System.Tests.Perf_String.Replace_Char -c Release --noOverwrite --memory --artifacts C:\String.Replace_Char\writebarrier_3 --envVars COMPlus_GCWriteBarrier:3 COMPlus_EnableWriteXorExecute:0
dotnet run -f net7.0 --filter System.Tests.Perf_String.Replace_Char -c Release --noOverwrite --memory --artifacts C:\String.Replace_Char\segments --envVars COMPLUS_GCName:clrgc.dll
The main observation was that there was not a significant enough difference after setting COMPlus_GCWriteBarrier=3 and COMPlus_EnableWriteXorExecute=0 and therefore, we regressed by ~7.5% for .NET 7 in comparison to .NET 6 for System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+').
Also, we didn't observe the regression for this large string for segments implying that this microbenchmark regressed once we enabled regions (full details below); unfortunately, the trend for this microbenchmark doesn't go back to before we enabled regions:
BenchmarkDotNet=v0.13.1.1847-nightly, OS=Windows 11 (10.0.22000.856/21H2)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
Microbenchmarks Considered:
Benchmark Name | Baseline | Comparand | Baseline Mean Duration | Comparand Mean Duration | Δ Mean Duration | Δ% Mean Duration |
---|
Benchmark Name | Baseline | Comparand | Baseline Mean Duration | Comparand Mean Duration | Δ Mean Duration | Δ% Mean Duration |
---|---|---|---|---|---|---|
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'z', newChar: 'y') | baseline | segments | 7.1 | 5.59 | -1.51 | -21.29 |
Benchmark Name | Baseline | Comparand | Baseline Mean Duration | Comparand Mean Duration | Δ Mean Duration | Δ% Mean Duration |
---|---|---|---|---|---|---|
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", oldChar: 'b', newChar: '+') | baseline | writebarrier_3 | 19.74 | 23.14 | 3.4 | 17.22 |
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", oldChar: 'b', newChar: '+') | baseline | writebarrier_1 | 19.74 | 21.81 | 2.07 | 10.49 |
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+') | baseline | writebarrier_1 | 117.59 | 126.24 | 8.65 | 7.35 |
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+') | baseline | writebarrier_3 | 117.59 | 125.53 | 7.94 | 6.75 |
Benchmark Name | Baseline | Comparand | Baseline Mean Duration | Comparand Mean Duration | Δ Mean Duration | Δ% Mean Duration |
---|---|---|---|---|---|---|
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'z', newChar: 'y') | baseline | writebarrier_3 | 7.1 | 5.84 | -1.26 | -17.78 |
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'z', newChar: 'y') | baseline | writebarrier_1 | 7.1 | 5.88 | -1.23 | -17.26 |
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I') | baseline | segments | 16.78 | 15.71 | -1.06 | -6.34 |
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'l', newChar: '!') | baseline | segments | 10.37 | 9.8 | -0.57 | -5.52 |
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'l', newChar: '!') | baseline | writebarrier_1 | 10.37 | 9.82 | -0.55 | -5.27 |
Benchmark Name | Baseline | Comparand | Baseline Mean Duration | Comparand Mean Duration | Δ Mean Duration | Δ% Mean Duration |
---|---|---|---|---|---|---|
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'l', newChar: '!') | baseline | writebarrier_3 | 10.37 | 9.91 | -0.46 | -4.4 |
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'a', newChar: 'b') | baseline | segments | 3.26 | 3.13 | -0.13 | -3.87 |
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I') | baseline | writebarrier_1 | 16.78 | 16.18 | -0.59 | -3.55 |
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I') | baseline | writebarrier_3 | 16.78 | 16.9 | 0.13 | 0.76 |
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+') | baseline | segments | 117.59 | 120.41 | 2.82 | 2.4 |
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'a', newChar: 'b') | baseline | writebarrier_3 | 3.26 | 3.36 | 0.1 | 2.97 |
System.Tests.Perf_String.Replace_Char(text: "Hello", oldChar: 'a', newChar: 'b') | baseline | writebarrier_1 | 3.26 | 3.36 | 0.1 | 3.06 |
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", oldChar: 'b', newChar: '+') | baseline | segments | 19.74 | 20.69 | 0.95 | 4.82 |
Does COMPLUS_GCWriteBarrier=3 help improve perf for this?
Not really, in fact on its own it seems to mostly make it worse. DOTNET_GCName="clrgc.dll" to disable regions consistently improves things:
Method | EnvironmentVariables | Mean | Ratio |
---|---|---|---|
WithContent1 | COMPLUS_GCWriteBarrier=3 | 118.50 ns | 1.05 |
WithContent1 | DOTNET_GCName=clrgc.dll | 104.32 ns | 0.94 |
WithContent1 | DOTNET_GCName=clrgc.dll,COMPLUS_GCWriteBarrier=3 | 103.97 ns | 0.93 |
WithContent1 | Empty | 111.99 ns | 1.00 |
WithContent2 | COMPLUS_GCWriteBarrier=3 | 119.47 ns | 1.04 |
WithContent2 | DOTNET_GCName=clrgc.dll | 103.81 ns | 0.91 |
WithContent2 | DOTNET_GCName=clrgc.dll,COMPLUS_GCWriteBarrier=3 | 105.34 ns | 0.92 |
WithContent2 | Empty | 114.48 ns | 1.00 |
WithoutContent1 | COMPLUS_GCWriteBarrier=3 | 93.04 ns | 0.96 |
WithoutContent1 | DOTNET_GCName=clrgc.dll | 86.99 ns | 0.89 |
WithoutContent1 | DOTNET_GCName=clrgc.dll,COMPLUS_GCWriteBarrier=3 | 84.69 ns | 0.87 |
WithoutContent1 | Empty | 97.29 ns | 1.00 |
WithoutContent2 | COMPLUS_GCWriteBarrier=3 | 92.96 ns | 1.02 |
WithoutContent2 | DOTNET_GCName=clrgc.dll | 84.50 ns | 0.92 |
WithoutContent2 | DOTNET_GCName=clrgc.dll,COMPLUS_GCWriteBarrier=3 | 82.33 ns | 0.90 |
WithoutContent2 | Empty | 91.44 ns | 1.00 |
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;
using System;
[Config(typeof(ConfigWithCustomEnvVars))]
public partial class Program
{
static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
private sealed class ConfigWithCustomEnvVars : ManualConfig
{
public ConfigWithCustomEnvVars()
{
AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).AsBaseline());
AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithEnvironmentVariables(new EnvironmentVariable("DOTNET_GCName", "clrgc.dll")));
AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithEnvironmentVariables(new EnvironmentVariable("COMPLUS_GCWriteBarrier", "3")));
AddJob(Job.Default.WithRuntime(CoreRuntime.Core70).WithEnvironmentVariables(new EnvironmentVariable("DOTNET_GCName", "clrgc.dll"), new EnvironmentVariable("COMPLUS_GCWriteBarrier", "3")));
}
}
const string Input = """
Whose woods these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his woods fill up with snow.
My little horse must think it queer
To stop without a farmhouse near
Between the woods and frozen lake
The darkest evening of the year.
He gives his harness bells a shake
To ask if there is some mistake.
The only other sound’s the sweep
Of easy wind and downy flake.
The woods are lovely, dark and deep,
But I have promises to keep,
And miles to go before I sleep,
And miles to go before I sleep.
""";
private char[] _chars = Input.ToCharArray();
[Benchmark]
public string WithContent1() => new string(_chars);
[Benchmark]
public string WithContent2() => string.Create(Input.Length, Input, (dest, state) => state.AsSpan().CopyTo(dest));
[Benchmark]
public string WithoutContent1() => string.Create(Input.Length, Input, (dest, state) => { });
[Benchmark]
public string WithoutContent2() => new string('\0', Input.Length);
}
Seeing possibly similar issues over in #64626 (ignore the PGO aspect; we're regressed even w/o PGO).
Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.
Author: | danmoseley |
---|---|
Assignees: | - |
Labels: | `blocking-release`, `tenet-performance`, `area-GC-coreclr`, `regression-from-last-release` |
Milestone: | 7.0.0 |
Repeated the WithContent
experiment in a GC centric way where we hardcoded the iterations, reduced the number of forced induced GC by BDN and prevented the removal of outliers between segments, regions and regions with a more precise write barrier and concluded similar results as above:
Next steps: We are tracking decommit issues here: https://github.com/dotnet/runtime/issues/73592 and since the diagnosis of this regression is perceivably linked to that issue that @PeterSolMS is currently working on fixing, I'm moving the milestone to 8.0; Please let me know if that's an incorrect follow-up.
CC: @dotnet/gc
BenchmarkDotNet=v0.13.1.1847-nightly, OS=Windows 11 (10.0.22000.856/21H2)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Job-YUBOFT : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Benchmark Name | Baseline | Comparand | Baseline Mean Duration | Comparand Mean Duration | Δ Mean Duration | Δ% Mean Duration | Baseline number of iterations | Comparand number of iterations | Δ number of iterations | Δ% number of iterations | Baseline gc count | Comparand gc count | Δ gc count | Δ% gc count | Baseline median | Comparand median | Δ median | Δ% median | Baseline non induced gc count | Comparand non induced gc count | Δ non induced gc count | Δ% non induced gc count | Baseline total allocated (mb) | Comparand total allocated (mb) | Δ total allocated (mb) | Δ% total allocated (mb) | Baseline total pause time (msec) | Comparand total pause time (msec) | Δ total pause time (msec) | Δ% total pause time (msec) | Baseline gc pause time % | Comparand gc pause time % | Δ gc pause time % | Δ% gc pause time % | Baseline PageFault | Comparand PageFault | Δ PageFault | Δ% PageFault |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
System.Tests.Perf_String.WithContent1 | baseline | writebarrier_3 | 209.38 | 675 | 465.62 | 222.39 | 64 | 64 | 0 | 0 | 4 | 4 | 0 | 0 | 0 | 100 | 100 | ∞ | 0 | 0 | 0 | NaN | 1.91 | 1.92 | 0 | 0.08 | 1.52 | 1.52 | -0 | -0 | 2.04 | 2.44 | 0.4 | 19.42 | 11 | 13 | 2 | 18.18 |
System.Tests.Perf_String.WithContent1 | baseline | regions | 209.38 | 389.06 | 179.69 | 85.82 | 64 | 64 | 0 | 0 | 4 | 4 | 0 | 0 | 0 | 0 | 0 | NaN | 0 | 0 | 0 | NaN | 1.91 | 1.92 | 0 | 0.09 | 1.52 | 1.52 | -0 | -0.01 | 2.04 | 2.2 | 0.16 | 7.86 | 11 | 16 | 5 | 45.45 |
OutlierMode=DontRemove EnvironmentVariables=COMPlus_GCName=clrgc.dll PowerPlanMode=00000000-0000-0000-0000-000000000000
Force=False InvocationCount=1 IterationCount=64
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
UnrollFactor=1 WarmupCount=1
Method | Mean | Error | StdDev | StdErr | Median | Min | Max | Q1 | Q3 | Op/s | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
WithContent1 | 209.4 ns | 198.4 ns | 459.7 ns | 57.46 ns | 0.0 ns | 0.0 ns | 2,200.0 ns | 0.0 ns | 100.0 ns | 4,776,119.4 | 1.79 KB |
OutlierMode=DontRemove PowerPlanMode=00000000-0000-0000-0000-000000000000 Force=False
InvocationCount=1 IterationCount=64 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 UnrollFactor=1
WarmupCount=1
Method | Mean | Error | StdDev | StdErr | Median | Min | Max | Q1 | Q3 | Op/s | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
WithContent1 | 389.1 ns | 356.7 ns | 826.8 ns | 103.3 ns | 0.0 ns | 0.0 ns | 2,500.0 ns | 0.0 ns | 0.0 ns | 2,570,281.1 | 1.79 KB |
OutlierMode=DontRemove EnvironmentVariables=COMPlus_GCWriteBarrier=3,COMPlus_EnableWriteXorExecute=0 PowerPlanMode=00000000-0000-0000-0000-000000000000
Force=False InvocationCount=1 IterationCount=64
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
UnrollFactor=1 WarmupCount=1
Method | Mean | Error | StdDev | StdErr | Median | Min | Max | Q1 | Q3 | Op/s | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
WithContent1 | 675.0 ns | 833.7 ns | 1,932.3 ns | 241.5 ns | 100.0 ns | 0.0 ns | 14,400.0 ns | 100.0 ns | 125.0 ns | 1,481,481.5 | 1.79 KB |
I'm moving the milestone to 8.0; Please let me know if that's an incorrect follow-up.
@mrsharm the latest data comparing 6.0 to 7.0 (with standard settings) seems to show significant regressions. I don't think we want to willingly ship with these regressions. I think we need to keep this in .NET 7 milestone and the GC path, one way or another.
Given we don't want to regress, are these our options for .NET 7?
revert the original change as a workaround
What change? The most recent benchmark examples aren't using Replace.
this should be investigated to understand what exactly is causing the regression before we move the milestone to 8.0. last time I was going to investigate these regressions with Moko we discovered that BDN was doing things very differently between 2 runs so he's been focusing on making the runs repeatable. so I'm going to take another look again with him.
What change? The most recent benchmark examples aren't using Replace.
Ah, I didn't see that. Scratch that option then. So we should either confirm it's not something real code would see, or find a fix.
Others are noticing this as well, e.g. https://twitter.com/realDotNetDave/status/1569724335220088832
We have been working through root-causing and fixing this issue and here are our conclusions:
--outliers DontRemove
.Any thoughts / concerns here?
Benchmark | Baseline | Comparand | ΔMean | ΔMean % |
---|---|---|---|---|
System.Tests.Perf_String.Replace_Char_Custom | segments | regions | 8.04 | 9.28 |
System.Tests.Perf_String.Replace_Char_Custom | segments | decommit_fix | 4.84 | 5.72 |
System.Tests.Perf_String.Replace_Char_Custom
is the benchmark created here. Benchmark | Baseline | Comparand | ΔMean | ΔMean % |
---|---|---|---|---|
System.Tests.Perf_String.WithContent1 | segments | regions | 3.21 | 4.44 |
System.Tests.Perf_String.WithContent1 | segments | decommit_fix | 1.84 | 2.55 |
System.Tests.Perf_String.WithContent1
is the benchmark created here.Benchmark | Baseline | Comparand | ΔMean | ΔMean % |
---|---|---|---|---|
System.Tests.Perf_String.WithoutContent2 | segments | regions | 2.78 | 5.26 |
System.Tests.Perf_String.WithoutContent2 | segments | decommit_fix | 1.43 | 2.7 |
System.Tests.Perf_String.WithoutContent2
is the benchmark created here.Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Job-OERVVK : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Full comparison table with the virtual commit and decommit data: Compare.xlsx
CC: @dotnet/gc
Thanks for the analysis @mrsharm. If the overall regression < 5% with the decommit fix we should be able to make further improvements in 8. @stephentoub would you agree?
@mrsharm
10% can be a lot, so it seems important to know what customers will see and where.
is it possible to characterize which types of scenarios are affected by this regression? What do they have in common?
The common symptoms are higher virtual commits and decommits - both in terms of bytes and the number of calls. We are tracking all other regressed benchmarks with the same root cause here: https://github.com/dotnet/runtime/issues/73592
2. Will customers experience this full regression or do we have reason to believe it's worse in benchmarking situations?
Other members of @dotnet/gc can chime in here for more details however, we do believe that we are doing better for actual customer scenarios based on our perf testing. Additionally, we do believe that prospective updates will improve GC pause times even further.
@mrsharm, I'm a little confused. https://github.com/dotnet/runtime/pull/73620 is about workstation GC, right? Does it also apply to server GC? All the regressions I've measured have been with server GC.
correct, the current fix is for wks, but @PeterSolMS @Maoni0 are working on a change for svr. @stephentoub, assume you observe regression for wks as well, or is it with svr only?
assume you observe regression for wks as well, or is it with svr only?
I'd only measured with server.
we didn't try the server GC fix on these since we didn't know you were running this with Server GC. how many heaps are you running this with? would it be possible that you could take a GCCollectOnly trace when you running it?
perfview /nogui /accepteula /GCCollectOnly collect test.etl
we didn't try the server GC fix on these since we didn't know you were running this with Server GC. how many heaps are you running this with? would it be possible that you could take a GCCollectOnly trace when you running it?
This is on a 12 logical core, so I assume 12 (I've not overridden any defaults, other than having <ServerGarbageCollection>true</ServerGarbageCollection>
in my csproj). These are all trivial microbenchmarks, with the effect visible in or out of benchmarkdotnet. For example, this:
using System.Diagnostics;
using System.Runtime;
internal class Program
{
static void Main()
{
Console.WriteLine($"IsServerGC: {GCSettings.IsServerGC}");
var sw = new Stopwatch();
for (int trials = 0; trials < 10; trials++)
{
sw.Restart();
for (int i = 0; i < 100_000_000; i++)
{
_ = new string('\0', 256);
}
sw.Stop();
Console.WriteLine(sw.Elapsed.TotalMilliseconds);
}
}
}
outputs the following on .NET 6:
IsServerGC: True
3541.6588
3853.6238
3792.7101
3909.1076
3854.9526
3445.2005
3344.1826
3343.1901
3331.0722
3410.051
and the following on .NET 7:
IsServerGC: True
4627.28
4488.4879
4512.589
4499.0618
5184.8121
4649.4892
4735.7333
4693.8634
4869.3155
4920.06
(obviously times vary a bit from run to run, but the relative size of the gap remains).
I can send you a GCCollectOnly trace if you can't repro it. I assumed from all of the comments from @mrsharm earlier that it was easily reproed.
I assumed from all of the comments from @mrsharm earlier that it was easily reproed.
he never attempted to repro it with server GC since none of us knew you ran this with server GC :) and this is a very unusual scenario for running server GC with (we don't run microbenchmarks with server gc because they are in general tiny things). but yes, we could repro it after we tried it so we don't need a trace.
he never attempted to repro it with server GC since none of us knew you ran this with server GC :) and this is a very unusual scenario for running server GC with (we don't run microbenchmarks with server gc because they are in general tiny things)
Except for specific scenarios that demand otherwise, I always benchmark with server GC, since that's the default ASP.NET configuration.
keep in mind that asp.net does not do all these induced GCs - these induced GCs that BDN does can dramatically change the memory perf behavior.
It's why I also replicate findings without benchmarkdotnet, e.g. in the simple console app shown earlier where there aren't any GC.Collect calls.
so I took a look at the test @stephentoub had running in BDN and standalone. the reasons for the regression is very different. for the BDN scenario we are simply doing a ton of decommitting with regions because BDN induced a lot of gen2 blocking GCs which meant we often got a new gen1 region that's fully committed so we ended up always decommitting the end of it. so it's more an artifact of running BDN. for the standalone case we do a bit of decommitting, the major regression comes from memset costing more. Peter attempted a fix for it but we don't understand the microarchitectural effect what we are seeing there for me to be comfortable to merge it. and the fix makes the segment case improve just as much so regions is not better off :D I'm also seeing higher GC cost but this could totally be due to the fact it's a microbenchmark. I'm not so worried about this because we already validated the GC part a bunch.
the summary is we are not trying to get anything in for this this week. we might try to get something in next week if we understand the microarchitectural issue but the likelihood is we'll make some fixes with the 1st servicing release.
removing release-blocking tag, we will possibly consider for servicing
We have improved the microbenchmark numbers for a number of benchmarks in NET 8 and are tracking: https://github.com/dotnet/runtime/issues/73592
Closing this issue as we are tracking this with the aforementioned issue.
Moving discussion from the PR https://github.com/dotnet/runtime/pull/67049
@gfoidl, at least on my machine, comparing string.Replace in .NET 6 vs .NET 7, multiple examples I've tried have shown .NET 7 to have regressed, e.g.
Method Runtime Mean Ratio Replace .NET 6.0 108.1 ns 1.00 Replace .NET 7.0 136.0 ns 1.26 Do you see otherwise?
@gfoidl
gfoidl commented yesterday Hm, that is not expected...
When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1889 (21H1/May2021Update) Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-preview.7.22377.5 [Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT DefaultJob : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT
Method Mean Error StdDev Median Ratio RatioSD Default 142.0 ns 3.48 ns 9.98 ns 138.6 ns 1.00 0.00 PR 132.9 ns 2.68 ns 3.40 ns 132.8 ns 0.92 0.07 so a result I'd expect, as after the vectorized loop 6 chars are remaining that the old-code processes in the for-loop whilst the new-code does one vectorized pass.
I checked the dasm (via DisassemblyDiagnoser of BDN) and that looks OK.
Can this be something from different machine-code layout (loops), PGO, etc. that causes the difference between .NET 6 and .NET 7? How can I investigate this further -- need some guidance on how to check code-layout please.
@stephentoub stephentoub commented yesterday • Thanks, @gfoidl. Do you see a similar 6 vs 7 difference as I do? (It might not be specific to this PR.) @EgorBo, can you advise?
@tannergooding tannergooding commented yesterday When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see
This could be related to stale PGO data
@danmoseley danmoseley commented yesterday Is there POGO data en-route that has trained with this change in place? I am not sure how to follow it.
@danmoseley danmoseley commented yesterday Also, it wouldn't matter here, but are we consuming POGO data trained on main bits in the release branches?
@stephentoub stephentoub commented yesterday • I don't think this particular case is related to stale PGO data. I set COMPlus_JitDisablePGO=1, and I still see an ~20% regression from .NET 6 to .NET 7.
@danmoseley danmoseley commented 21 hours ago • I ran the example above with
and got
BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2) Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-rc.2.22426.5 [Host] : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-DGTURM : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-PYGDYG : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-ZEPFOF : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-PSEWWK : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256 Job-WGVIGL : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-HBSVYM : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-VWWZUC : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-LDCOEC : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256
code https://gist.github.com/danmoseley/c31bc023d6ec671efebff7352e3b3251
(should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS )
@danmoseley danmoseley commented 21 hours ago and just for interest
@gfoidl gfoidl commented 9 hours ago Do you see a similar 6 vs 7 difference as I do?
Yes (sorry for slow response, was Sunday...). @danmoseley thanks for your numbers.
This is the machine code I get (from BDN) when run @danmoseley's benchmark (.NET 7 only). Left there some comments.
So from code-layout one major difference to .NET 6 is that the call to System.Buffer.Memmove is moved out of the hot-path. But I doubt that this allone is the cause for the regression.
I also wonder why vpblendvb is gone when using string.Replace in the benchmark from .NET-bits. If I use a string.Replace-duplicated code for the benchmark, then it's emitted which is what I expect as https://github.com/dotnet/runtime/commit/10d8a36ab669ac95f554e5efcc3c8780b5c50f11 got merged on 2022-05-25. But that shouldn't cause the regression either, as for .NET 6 the same series of vector-instruction are emitted.
The beginning of the method, right after the prolog, looks different between .NET 6 and .NET 7, although this PR didn't change anything here. I don't expect that this causes the regression, as with the given input the vectorized loop with 33 iterations should be dominant enough (just my feeling, maybe wrong).
So far the "static analysis", but I doubt this is enough. With Intel VTune I see some results, but with my interpretation the conclusions are just the same as stated in this comment. I hope some JIT experts can shed some light on this (and give some advices on how to investigate, as I'm eager to learn).
Machine code for .NET 6 (for reference)
```asm ; System.String.Replace(Char, Char) push r15 push r14 push rdi push rsi push rbp push rbx sub rsp,28 vzeroupper mov rsi,rcx movzx edi,dx movzx ebx,r8w cmp edi,ebx jne short M01_L00 mov rax,rsi vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L00: lea rbp,[rsi+0C] mov rcx,rbp mov r14d,[rsi+8] mov r8d,r14d mov edx,edi call System.SpanHelpers.IndexOf(Char ByRef, Char, Int32) mov r15d,eax test r15d,r15d jge short M01_L01 mov rax,rsi vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret M01_L01: mov esi,r14d sub esi,r15d mov ecx,r14d call System.String.FastAllocateString(Int32) mov r14,rax test r15d,r15d jle short M01_L02 cmp [r14],r14d lea rcx,[r14+0C] mov rdx,rbp mov r8d,r15d add r8,r8 call System.Buffer.Memmove(Byte ByRef, Byte ByRef, UIntPtr) M01_L02: movsxd rax,r15d add rax,rax add rbp,rax cmp [r14],r14d lea rdx,[r14+0C] add rdx,rax cmp esi,10 jl short M01_L04 imul eax,edi,10001 vmovd xmm0,eax vpbroadcastd ymm0,xmm0 imul eax,ebx,10001 vmovd xmm1,eax vpbroadcastd ymm1,xmm1 M01_L03: vmovupd ymm2,[rbp] vpcmpeqw ymm3,ymm2,ymm0 vpand ymm4,ymm1,ymm3 vpandn ymm2,ymm3,ymm2 vpor ymm2,ymm4,ymm2 vmovupd [rdx],ymm2 add rbp,20 add rdx,20 add esi,0FFFFFFF0 cmp esi,10 jge short M01_L03 M01_L04: test esi,esi jle short M01_L08 nop word ptr [rax+rax] M01_L05: movzx eax,word ptr [rbp] mov rcx,rdx cmp eax,edi je short M01_L06 jmp short M01_L07 M01_L06: mov eax,ebx M01_L07: mov [rcx],ax add rbp,2 add rdx,2 dec esi test esi,esi jg short M01_L05 M01_L08: mov rax,r14 vzeroupper add rsp,28 pop rbx pop rbp pop rsi pop rdi pop r14 pop r15 ret ; Total bytes of code 307 ```@AndyAyersMS
AndyAyersMS commented 2 hours ago (should we be surprised that disabling PGO didn't make any difference? perhaps it doesn't exercise this method? cc @AndyAyersMS )
Hard to say without looking deeper -- from the .NET 7 code above I would guess PGO is driving the code layout changes.
For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method).