dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.57k stars 4.55k forks source link

Stabilize performance measurement #43227

Open kunalspathak opened 3 years ago

kunalspathak commented 3 years ago

The code quality and performance of RyuJIT is tracked internally by running MicroBenchmarks in our performance lab. We regularly triage the performance issues opened by the .NET performance team. After going through these issues for past several months, we have identified some key points.

Stability

Many times, the set of commits that are flagged as introducing regression in a benchmark, do not touch the code that is tested in the benchmark. In fact, the assembly code generated for the .NET code that is being tested is often identical and yet the measurements show differences. Some of our investigation reveals that the fluctuation in the benchmark measurements happen because of the misalignment of generated JIT code in process memory. Below is an example of LoopReturn benchmark that shows such behavior.

image

It is very time consuming for .NET developers to do the analysis of benchmarks that regressed because of things that are out of control of .NET runtime. In the past, we have closed several issues like https://github.com/dotnet/runtime/issues/13770, https://github.com/dotnet/runtime/issues/39721 and https://github.com/dotnet/runtime/issues/39722 because they were regressions because of code alignment. A great example that we found out while investigating those issues was the change introduced in https://github.com/dotnet/runtime/pull/38586 eliminated a test instruction and should have showed improvement in the benchmarks, but introduced regression because the code (loop code inside method) now gets misaligned and the method runs slower.

Alignment issues was brought up few times in https://github.com/dotnet/runtime/issues/9912 and https://github.com/dotnet/runtime/issues/8108 and this issue tracks the progress towards the goal of stabilizing and possibly improving the performance of .NET apps that are heavily affected because of code alignment.

Performance lab infrastructure

Once we address the code alignment issue, the next big thing will be to identify and make required infrastructure changes in our performance lab to make sure that it can easily flag such issues without needing much interaction from .NET developers. For example, https://github.com/dotnet/BenchmarkDotNet/issues/1513 proposes to make memory alignment in the benchmark run random to catch these issues early and once we address the underlying problem in .NET, we should never see bimodal behavior of those benchmarks. After that, if the performance lab does find a regression in the benchmark, we need to have robust tooling support to get possible metrics from performance runs so that a developer doing the investigation can easily identify the cause of regression. For example, identifying the time spent in various phases of .NET runtime like Jitting, Jit interface, Tier0/Tier1 JIT code, hot methods, instructions retired during benchmark execution and so forth.

Reliable benchmarks collection

Lastly, for developers working on JIT, we want to identify set of benchmarks that are stable enough and can be trusted to give us reliable measurement whenever there is a need to verify the performance for changes done to the JIT codebase. This will help us conduct performance testing ahead of time and identify potential regressions rather than waiting it to happen in performance lab.

Here are set of work items that we have identified to achieve all the above:

Code alignment work

Future work

Performance tooling work

kunalspathak commented 3 years ago

@dotnet/jit-contrib , @danmosemsft

danmoseley commented 3 years ago

cc @DrewScoggins @billwert @adamsitnik

danmoseley commented 3 years ago

To state the obvious (I think) -- I believe we have good data to suggest that alignment is the dominant reason for bimodality. I'm not sure though that we can be sure there aren't other common causes for bimodality -- my assumption is that we'll find out how much is left when you've completed some of this work.

kunalspathak commented 3 years ago

To state the obvious (I think) -- I believe we have good data to suggest that alignment is the dominant reason for bimodality. I'm not sure though that we can be sure there aren't other common causes for bimodality -- my assumption is that we'll find out how much is left when you've completed some of this work.

Yes, there will definitely be more reason for bimodality, but alignment will fix most of the obvious ones that we know of and then, it will be easier for us to focus one remaining ones. Currently, there are just too many bimodal benchmarks.

JulieLeeMSFT commented 3 years ago

Bolded WIP texts in the plan.

kunalspathak commented 3 years ago

With the #44370, we will start aligning inner loops at 32B boundary with padding up to 16 bytes.

The code size impact on .NET libraries (from pmi runs)

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 51622033
Total bytes of diff: 51639087
Total bytes of delta: 17054 (0.03% of base)
    diff is a regression.
Summary of Allocation Size diffs:
(Lower is better)

Total bytes of base: 51940550
Total bytes of diff: 51957619
Total bytes of delta: 17069 (0.03% of base)
    diff is a regression.

Details : https://github.com/dotnet/runtime/pull/44370#issuecomment-749263096

Out of 380K+ methods of .NET libraries (using PMI), based on loop alignment heuristics, we will align 1766 loops in 1515 methods. Details: https://github.com/dotnet/runtime/pull/44370#issuecomment-758133875

danmoseley commented 3 years ago

@DrewScoggins it will be interesting to see whether you start seeing less noise tomorrow.

kunalspathak commented 3 years ago

Below is from weekly comparison report:

Linux x64:

image

https://pvscmdupload.blob.core.windows.net/reports/01_18_2021/report_Weekly_ca=x64_cb=refs-heads-master_co=Ubuntu1804_cr=dotnetruntime_cc=CompliationMode=tiered-RunKind=micro_2021-01-18.html#Improvements

Windows x64:

image

https://pvscmdupload.blob.core.windows.net/reports/01_18_2021/report_Weekly_ca=x64_cb=refs-heads-master_co=Windows1018362_cr=dotnetruntime_cc=CompliationMode=tiered-RunKind=micro_2021-01-18.html

Windows x86:

image

https://pvscmdupload.blob.core.windows.net/reports/01_18_2021/report_Weekly_ca=x86_cb=refs-heads-master_co=Windows1018362_cr=dotnetruntime_cc=CompliationMode=tiered-RunKind=micro_2021-01-18.html

Some of the improvements above comes from @stephentoub 's https://github.com/dotnet/runtime/pull/46895. But, the next thing I will do is analyze the regressions.

adamsitnik commented 3 years ago

@kunalspathak from the reports that you have provided, the following benchmarks look fishy to me:

# Linux x64
System.Memory.Span<Int32>.SequenceEqual(Size: 512)
System.Memory.Span<Int32>.EndsWith(Size: 512)
Span.IndexerBench.KnownSizeCtor2(length: 1024)
System.Memory.Span<Byte>.SequenceEqual(Size: 512)
System.MathBenchmarks.Single.Max // this can be a benchmark regression, but not actual product code regression
Benchstone.BenchI.LogicArray.Test
# Windows x64
System.Text.Json.Tests.Utf8JsonReaderCommentsTests.Utf8JsonReaderCommentParsing(CommentHandling: Skip, SegmentSize: 0, TestCase: LongMultiLine)
System.Collections.ContainsFalse<String>.ICollection(Size: 512)  // this can be a benchmark regression, but not actual product code regression
# Windows x86
System.Collections.ContainsKeyFalse<String, String>.Dictionary(Size: 512)
PerfLabTests.CastingPerf.CheckObjIsInterfaceYes // this can be a benchmark regression, but not actual product code regression
PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceNo // this can be a benchmark regression, but not actual product code regression
PerfLabTests.CastingPerf.CheckObjIsInterfaceNo // this can be a benchmark regression, but not actual product code regression

but there is much more improvements than regressions!

danmoseley commented 3 years ago

It's interesting that some of the regressing scenarios have also started becoming noisy eg System.MathBenchmarks.Single.Max

kunalspathak commented 3 years ago

Thanks @adamsitnik . Could you please explain the data point that makes you think "this can be a benchmark regression, but not actual product code regression". It will help me in quicker investigation for these benchmarks on my part. Here is couple of benchmarks that investigated yesterday:

System.Text.Json.Tests.Utf8JsonReaderCommentsTests.Utf8JsonReaderCommentParsing(CommentHandling: Skip, SegmentSize: 0, TestCase: LongMultiLine)

This looks like a real regression because of the execution of align instruction inside JsonReaderHelper::CountNewLines. image

It was improved in November after https://github.com/dotnet/runtime/pull/42909 change, but with loop alignment changes, it went into higher end. In below screenshot:

  1. is the method alignment change
  2. is the loop alignment change.

image

From Vtune:

image

I am not sure how much vtune points to the nop execution contribution, but my guess is that it contributes to the slowness, given that CountNewLines is the hottest method.

image

System.Collections.ContainsFalse.ICollection(Size: 512)

I was not able to reproduce this on my machine. I also compared the disassembly and we don't add align instruction in the benchmark code or methods that we are testing. My initial guess is that it could be because of data alignment since we allocate array during setup.

Here is the history of that test which confirms the bimodality.

image

If there was any code alignment issue (which I will dig more later today), the loop alignment would have addressed it and the benchmark went to the slower numbers, but data alignment in future runs might change it and we might again see the bimodality.

image

@danmosemsft - Thanks for pointing it out. Even we noticed it and it is interesting because it shouldn't be affected by the data alignment. I will investigate further.

adamsitnik commented 3 years ago

Could you please explain the data point that makes you think "this can be a benchmark regression, but not actual product code regression"

Sure! Some of these benchmarks have loops inside them (while they should not) like the casting benchmark:

[Benchmark]
public bool CheckObjIsInterfaceYes()
{
    bool res = false;
    for (int i = 0; i < InnerIterationCount; i++)
        res = myClass1 is IMyInterface1;
    return res;
}

By "this can be a benchmark regression, but not actual product code regression" I meant that the part of the product that the benchmark is trying to measure has not regressed (casting in this example), but it's exercised in a loop, which shows benchmark regression.

kunalspathak commented 3 years ago

Some of these benchmarks have loops inside them (while they should not)

Do we have plan to fix them?

Also, we noticed the following numbers for bubblesort2. The benchmark is less bimodal after loop alignment, but the remaining swing might be because of data alignment since we create an array inside the benchmark.

https://github.com/dotnet/performance/blob/a25a963e3f14214102207a2b75a5ece25162a333/src/benchmarks/micro/runtime/Benchstones/BenchI/BubbleSort2.cs#L32

Here is the overall history:

image

and then zoomed in to the loop alignment impact:

image

kunalspathak commented 3 years ago

I went through most of the regressions in https://github.com/dotnet/runtime/issues/43227#issuecomment-765684981 and here is my analysis.

Summary

To conclude, I did not see any surprises in my analysis and we would expect these results from loop alignment. I am working with @DrewScoggins to get similar report on how much stability did "loop alignment" improved.

Benchmark's analysis

CastingPerf2.CastingPerf.ObjO

We do not do any loop alignment in benchmark code because the loop has calls. Historically, the test is very flaky.

image

KnownSizeCtor2

I am not sure why the benchmark calls into a test method that contains the actual benchmark code. But @adamsitnik is right about this benchmark. We add 4 bytes of NOP compensation during the prolog and then 4 bytes of alignment before the 2nd loop, but then, none of the alignment happens for the actual library code (new Span()) that the benchmark is testing. So the regression might be from the extra 8 bytes of NOP/align instruction, but it shouldn't affect the library code. Perhaps, we should change this benchmark to not have loop and just have benchmark code in the method that is annotated with [Benchmark].

JsonNet_

These might be real regression, although from Vtune profiling, it was not that obvious. However, key observation was that we added align instruction in some of the hot libraries methods and by doing "advanced padding", we will be able to get back some of the regressions. For instance, in assembly code for SpanHelpers.Contains(), we added align instruction in a block that had weight of 4 although there were colder blocks before it where we could have added this align instruction.

SpanHelpers.Contains assembly code ```asm ; Assembly listing for method System.SpanHelpers:Contains(byref,ushort,int):bool ; Emitting BLENDED_CODE for generic X86 CPU - Windows ; ; Lcl frame size = 40 G_M35514_IG01: ;; offset=0000H 55 push ebp 8BEC mov ebp, esp 57 push edi 56 push esi 53 push ebx 83EC28 sub esp, 40 C5F877 vzeroupper 33C0 xor eax, eax 8945F0 mov dword ptr [ebp-10H], eax 8BD9 mov ebx, ecx 8BFA mov edi, edx 8B7508 mov esi, dword ptr [ebp+08H] ;; bbWeight=1 PerfScore 8.25 G_M35514_IG02: ;; offset=0018H 8BCE mov ecx, esi F7D1 not ecx C1E91F shr ecx, 31 8B153820A306 mov edx, gword ptr [06A32038H] '' 85C9 test ecx, ecx 7507 jne SHORT G_M35514_IG04 ;; bbWeight=1 PerfScore 5.75 G_M35514_IG03: ;; offset=0029H 8BCA mov ecx, edx E858598CF8 call System.Diagnostics.Debug:Fail(System.String,System.String) ;; bbWeight=0.25 PerfScore 0.31 G_M35514_IG04: ;; offset=0030H 895DF0 mov bword ptr [ebp-10H], ebx 8D0473 lea eax, [ebx+2*esi] 83FE20 cmp esi, 32 7C55 jl SHORT G_M35514_IG07 ;; bbWeight=1 PerfScore 2.75 G_M35514_IG05: ;; offset=003BH 8BCB mov ecx, ebx 83E11F and ecx, 31 8BD1 mov edx, ecx C1EA1F shr edx, 31 03CA add ecx, edx D1F9 sar ecx, 1 8BF1 mov esi, ecx F7DE neg esi 83C610 add esi, 16 83E60F and esi, 15 83FE04 cmp esi, 4 7C3D jl SHORT G_M35514_IG08 ;; bbWeight=0.50 PerfScore 2.88 G_M35514_IG06: ;; offset=0058H 83C6FC add esi, -4 0FB7D7 movzx edx, di 0FB70B movzx ecx, word ptr [ebx] 3BD1 cmp edx, ecx 0F84FE000000 je G_M35514_IG17 0FB74B02 movzx ecx, word ptr [ebx+2] 3BD1 cmp edx, ecx 0F84F2000000 je G_M35514_IG17 0FB74B04 movzx ecx, word ptr [ebx+4] 3BD1 cmp edx, ecx 0F84E6000000 je G_M35514_IG17 0FB74B06 movzx ecx, word ptr [ebx+6] 3BD1 cmp edx, ecx 0F84DA000000 je G_M35514_IG17 83C308 add ebx, 8 ;; bbWeight=2 PerfScore 27.50 G_M35514_IG07: ;; offset=0090H 83FE04 cmp esi, 4 7DC3 jge SHORT G_M35514_IG06 ;; bbWeight=16 PerfScore 20.00 G_M35514_IG08: ;; offset=0095H 85F6 test esi, esi 7E20 jle SHORT G_M35514_IG10 0FB7D7 movzx edx, di 90909090 align ;; bbWeight=4 PerfScore 7.00 ```

For this particular benchmark, I gathered all the align instruction we added in various methods that were JITted during the execution of the benchmark. Note that, some of them might not be contributing to the benchmark measurement.

Method Name Align bytes Block weight
System.SpanHelpers:SequenceCompareTo(byref,int,byref,int):int 4 0.5
System.SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,int):int 3 0.5
System.SpanHelpers:Contains(byref,ushort,int):bool 4 4
Newtonsoft.Json.DefaultJsonNameTable:Add(System.String):System.String:this 7 0.5
System.SpanHelpers:IndexOf(byref,ubyte,int):int 5 2
Newtonsoft.Json.Utilities.JavaScriptUtils:ShouldEscapeJavaScriptString(System.String,System.Boolean[]):bool 3 0.5
System.SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int 12 0.5
Newtonsoft.Json.DefaultJsonNameTable:Add(System.String):System.String:this 7 0.5

CheckObjIsInterfaceYes

This one was little surprising as the benchmark is straightforward. We do not align the loop inside benchmark because of the presence of call. So I am not sure why it would regress.

public bool CheckObjIsInterfaceYes()
{
    bool res = false;
    for (int i = 0; i < InnerIterationCount; i++)
        res = myfoo is IFoo;
    return res;
}

Single.Max

Like previous one, this too has a loop and as @adamsitnik pointed, although the benchmark regresses, the product code doesn't. In this benchmark, we disable VEX prefix encoding optimization for 2 instructions that are part of benchmark code. Then we add 1 byte align instruction. So the regression is coming from it.

image

image

LogicArray

As I mentioned earlier, this is a typical array benchmark that contains allocation (and hence data alignment issues). In this benchmark, we align the inner loop, but there is an enclosing loop that would process the added padding for alignment.

image

image

HeapSort

Here too, we "align" inner loop, but those padding instructions get executed by the outer loop and might have seen regression.

image

image

BenchJagged

We should just delete this benchmark or trim it. This does array allocation inside benchmark code and the measurements can be prone to GC behavior, data alignment, etc. I didn't deep dive into this one looking at the historical data of it.

image

Bubble sort

Finally, here is my favorite. This one too has data alignment, but after fixing the code alignment, we see less swing of measurements. In below graph, the first drop around October came after we started aligning methods at 32B boundary in https://github.com/dotnet/runtime/pull/42909 and the 2nd drop came after "loop alignment".

image

Zoomed to "loop alignment" change:

image

All this happens because we align the inner loop present in Bubble sort. Note that we have added 5 bytes padding in IG03 to align the loop. However, IG03 has weight=4 because it is part of the outer loop. We could get more performance benefits had we added the padding in IG01 or IG02 which has weight=1.

Assembly code of Inner() method ```asm ; Assembly listing for method MiniBench.BubbleSort2:Inner(System.Int32[]) ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; ; Lcl frame size = 32 G_M27289_IG01: ;; offset=0000H 56 push rsi 4883EC20 sub rsp, 32 ;; bbWeight=1 PerfScore 1.25 G_M27289_IG02: ;; offset=0005H B801000000 mov eax, 1 ;; bbWeight=1 PerfScore 0.25 G_M27289_IG03: ;; offset=000AH 8BD0 mov edx, eax 81FA4C1D0000 cmp edx, 0x1D4C 7F3F jg SHORT G_M27289_IG07 448B4108 mov r8d, dword ptr [rcx+8] 4C63C8 movsxd r9, eax 0F1F440000 align ;; bbWeight=4 PerfScore 16.00 G_M27289_IG04: ;; offset=0020H 413BC0 cmp eax, r8d 733D jae SHORT G_M27289_IG09 468B548910 mov r10d, dword ptr [rcx+4*r9+16] 413BD0 cmp edx, r8d 7333 jae SHORT G_M27289_IG09 4C63DA movsxd r11, edx 468B5C9910 mov r11d, dword ptr [rcx+4*r11+16] 453BD3 cmp r10d, r11d 7E0D jle SHORT G_M27289_IG06 ;; bbWeight=16 PerfScore 128.00 G_M27289_IG05: ;; offset=003CH 4863F2 movsxd rsi, edx 448954B110 mov dword ptr [rcx+4*rsi+16], r10d 46895C8910 mov dword ptr [rcx+4*r9+16], r11d ;; bbWeight=8 PerfScore 18.00 G_M27289_IG06: ;; offset=0049H FFC2 inc edx 81FA4C1D0000 cmp edx, 0x1D4C 7ECD jle SHORT G_M27289_IG04 ;; bbWeight=16 PerfScore 24.00 G_M27289_IG07: ;; offset=0053H FFC0 inc eax 3D4B1D0000 cmp eax, 0x1D4B 7EAE jle SHORT G_M27289_IG03 ;; bbWeight=4 PerfScore 6.00 G_M27289_IG08: ;; offset=005CH 4883C420 add rsp, 32 5E pop rsi C3 ret ;; bbWeight=1 PerfScore 1.75 G_M27289_IG09: ;; offset=0062H E8D9E8095F call CORINFO_HELP_RNGCHKFAIL CC int3 ;; bbWeight=0 PerfScore 0.00 ```

Thanks to @adamsitnik for helping me do faster analysis by pointing at the presence of loop inside benchmark code.

danmoseley commented 3 years ago

Nice analysis @kunalspathak. There were 2 or 3 todo's in your summary above, do they need issues, and listing in the tracking list at the top of this issue? I am happy to see this continuing progress.

As an aside, is there more than BDN can do to help regularize data alignment, when the buffer is allocated by the test and reused? Presumably by doing some tricks with padding in a struct and some unsafe code one can organize an array to have a certain alignment. Maybe BDN could handle such ugly tricks and expose an API for a test to allocate an array nicely aligned. @adamsitnik

kunalspathak commented 3 years ago

Another benchmark that I would like to highlight the impact on stability and performance is LoopReturn:

Ubuntu x64

image

Source: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmaster_x64_ubuntu%2018.04%2fLayout.SearchLoops.LoopReturn.html

Windows x64

image

Source: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmaster_x64_Windows%2010.0.18362%2fLayout.SearchLoops.LoopReturn.html

kunalspathak commented 3 years ago

There were 2 or 3 todo's in your summary above, do they need issues, and listing in the tracking list at the top of this issue?

Some of them were tracked in this issue's description. I have updated it to include my other findings.

danmoseley commented 3 years ago

As an aside, is there more than BDN can do to help regularize data alignment, when the buffer is allocated by the test and reused?

Never mind, I forgot about https://github.com/dotnet/performance/pull/1587 etc. we're attacking both sides. 😊

kunalspathak commented 3 years ago

Another round of analysis.

Stability

I went through the stability data for windows x64 and linux x64 that @DrewScoggins shared with me. In this report, "regression" means that loop alignment introduced more variance to the measurements and thus made the benchmark instable, while "improvements" meant that the loop alignment work actually stabilized the benchmark and reduced the variance. I analyzed around 150 benchmarks to see if the regressions were real by checking at the historical data of Windows x64 and Linux x64 for each of the 150 benchmark. If I see that the benchmark had lot of variance historically, I concluded that the loop alignment didn't added to the variance. There were few exceptions like Perf_Enumerable.Count and Single.Max where post-loop alignment we saw some variance, but other than that, most of the regression pointed out in report were not related to the loop alignment work.

Performance

Here are some of the take away points:

FannkuchRedux_9

image

The regression happens because of extra padding we added to align some of the loops inside benchmark code. Asmdiffs: https://www.diffchecker.com/ryZqqO6I.

MulMatrix

image

The regression happens because of addition of 5 "NOP overestimate compensation" in hot blocks. The overestimation happened for jump instructions. The way to address is by combining branch tightening with loop alignment adjustments (already planned work). Asmdiffs: https://www.diffchecker.com/IcktyNPi

TryGetUInt16:

image

In this method, TryParseUInt16D() is the hot method as seen from VTune screenshot:

image image

With loop alignment, we ended up adding padding at the end of hot block that VTune pointed above (Block 2). Asmdiffs: https://www.diffchecker.com/jHt5pv99

GetTypeCode

image

Although this shows regression, I verified that there is no loop inside the benchmark, nor a loop in the libraries method that it test. Further, I also compared the asm diffs and there is no "align" instruction.

ContainsTrue/ContainsFalse

image

image

image

This one is interesting. The hot method for all these benchmarks is GenericEqualityComparer`1[Canon][System.Canon]:IndexOf. In this method, we clone a loop, but later decide to not align it because it had call. However, we do not record that fact in the cloned loop because of which we have to align the cloned loop containing call. Additionally, until we emit the padding in the cloned loop, we have to also compensate the over-estimated instructions. One of the place where such overestimated instruction is compensated happen to be the actual hot loop that the method executes. You can see my detail analysis in https://github.com/dotnet/runtime/issues/43713#issuecomment-772781240.

JulieLeeMSFT commented 3 years ago

Good analysis that loop alignment work didn’t introduce regression.

What is cloned loop in the loop table?

Thanks, Julie


From: Kunal Pathak notifications@github.com Sent: Wednesday, February 3, 2021 4:04:47 PM To: dotnet/runtime runtime@noreply.github.com Cc: Julie Lee Julie.Lee@microsoft.com; State change state_change@noreply.github.com Subject: Re: [dotnet/runtime] Stabilize performance measurement (#43227)

Another round of analysis.

Stability

I went through the stability data for windows x64https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpvscmdupload.blob.core.windows.net%2Freports%2FallTestHistory%2Frefs%2Fheads%2Fmaster_x64_Windows%252010.0.18362%2FAllTestindex.html&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906853966%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=JzoKgDiI6s4aPuzmEKLXLKYP4p6D3%2FFBUcfTHMqQTh8%3D&reserved=0 and linux x64https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpvscmdupload.blob.core.windows.net%2Freports%2F01_18_2021%2Freport_Weekly_ca%3Dx64_cb%3Drefs-heads-master_co%3DUbuntu1804_cr%3Ddotnetruntime_cc%3DCompliationMode%3Dtiered-RunKind%3Dmicro_2021-01-18.html%23Improvements&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906863969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0Sh17Vl9rR2A1VHzGPvVg6o4sdok3Oo8xHLIGOE1n68%3D&reserved=0 that @DrewScogginshttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDrewScoggins&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906873964%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jC4vx7YwI%2BkeytbEAgtoYZ5olJYzSwRrMRIXtuxH9Yw%3D&reserved=0 shared with me. In this report, "regression" means that loop alignment introduced more variance to the measurements and thus made the benchmark instable, while "improvements" meant that the loop alignment work actually stabilized the benchmark and reduced the variance. I analyzed around 150 benchmarks to see if the regressions were real by checking at the historical data of Windows x64https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpvscmdupload.blob.core.windows.net%2Freports%2FallTestHistory%2Frefs%2Fheads%2Fmaster_x64_Windows%252010.0.18362%2FAllTestindex.html&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906873964%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=a0d01xOt9ES%2FDK47JR%2Bkte69X56LDC8eFrZigjIkED8%3D&reserved=0 and Linux x64https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpvscmdupload.blob.core.windows.net%2Freports%2FallTestHistory%2Frefs%2Fheads%2Fmaster_x64_ubuntu%252018.04%2FAllTestindex.html&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906883953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iQCzzCL2iCnirJk6AdsVNDHx9F3roGI1ooUe5yOwflY%3D&reserved=0 for each of the 150 benchmark. If I see that the benchmark had lot of variance historically, I concluded that the loop alignment didn't added to the variance. There were few exceptions like Perf_Enumerable.Count and Single.Max where post-loop alignment we saw some variance, but other than that, most of the regression pointed out in report were not related to the loop alignment work.

Performance

Here are some of the take away points:

FannkuchRedux_9

[image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106822899-dacafa00-6634-11eb-8cb1-b04ddde9263f.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906893953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Q54HzNgvPNQTXtwAQuPmzfTRqjVfhARvx1IAfrLJqNQ%3D&reserved=0

The regression happens because of extra padding we added to align some of the loops inside benchmark code. Asmdiffs: https://www.diffchecker.com/ryZqqO6Ihttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.diffchecker.com%2FryZqqO6I&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906903947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XFuE8AbjSsS%2B1%2Bpz4%2FrC9WrMKjw1s%2FvR8T6M9X6jSUg%3D&reserved=0.

MulMatrix

[image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106822997-0a7a0200-6635-11eb-8ba8-ef14f24e69b5.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906913945%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=f9kEpNalgN%2BN97G1T%2FoizEOzW0RFIlhxPqz7MT3%2Bnf4%3D&reserved=0

The regression happens because of addition of 5 "NOP overestimate compensation" in hot blocks. The overestimation happened for jump instructions. The way to address is by combining branch tightening with loop alignment adjustments (already planned work). Asmdiffs: https://www.diffchecker.com/IcktyNPihttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.diffchecker.com%2FIcktyNPi&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906913945%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rN6BDyNV4K0SxjzZspVE0HiLjWSTnNFxwg9qc72mFic%3D&reserved=0

TryGetUInt16:

[image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106823269-7b211e80-6635-11eb-8b4b-ac6e37c361bb.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906923938%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Hguv6A8Lir65jIlpbWgYcqk2n5ZClcGBnbkP8r9aoHU%3D&reserved=0

In this method, TryParseUInt16D() is the hot method as seen from VTune screenshot:

[image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106823363-a9066300-6635-11eb-9af3-66c59035bc18.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906933933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dDmX2yMppcyPM74kCdYxNOqxleIveSFuOrdGM9NNwGg%3D&reserved=0 [image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106823367-ab68bd00-6635-11eb-91c3-7d0d4d80a5ac.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906933933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8BT8fuTpmy05kjuMYqijLaCQvci84j9HcWexcFpXSw4%3D&reserved=0

With loop alignment, we ended up adding padding at the end of hot block that VTune pointed above (Block 2). Asmdiffs: https://www.diffchecker.com/jHt5pv99https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.diffchecker.com%2FjHt5pv99&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906943929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=H0s2zpi8O9CZmNbx4LzRh%2BdCKb%2Fo8JaIkBc42CN81Us%3D&reserved=0

GetTypeCode

[image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106823797-62fdcf00-6636-11eb-88f6-c1158cdf8cd8.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906953929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mFhfJ7Gil4MqAWvEbHRQp2Y1S2iHq1etBCHqYcYj%2F%2BU%3D&reserved=0

Although this shows regression, I verified that there is no loop inside the benchmark, nor a loop in the libraries method that it test. Further, I also compared the asm diffs and there is no "align" instruction.

ContainsTrue/ContainsFalse

[image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106800888-e529cb80-6615-11eb-82d1-d6c35d93cbd8.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906953929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=DmNJfqNMuKephBI0ZUD0V2Uj2US%2FN%2BMeKZqtNcah%2FkA%3D&reserved=0

[image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106800925-f07cf700-6615-11eb-8363-9f4d4bf541d3.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906963923%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rumBw9wW%2FIhj%2BA%2BogbUobw%2BGY6v1YvFvMWajnDjdgfQ%3D&reserved=0

[image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106800954-fb378c00-6615-11eb-9ae1-cbd1e33a749d.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906973917%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6JSCGxq6Hdv7TJWfo3AZB3qxMd9dzkaOXbEG0IVGZJ0%3D&reserved=0

This one is interesting. The hot method for all these benchmarks is GenericEqualityComparer`1[Canon][System.Canon]:IndexOfhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsource.dot.net%2F%23System.Private.CoreLib%2Fsrc%2FSystem%2FCollections%2FGeneric%2FEqualityComparer.CoreCLR.cs%2C19&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906973917%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=toPQAni6Sdcvh%2FhulUnwGD4NQDMT6uOHyPNo62h0VLo%3D&reserved=0. In this method, we clone a loop, but later decide to not align it because it had call. However, we do not record that fact in the cloned loop because of which we have to align the cloned loop containing call. Additionally, until we emit the padding in the cloned loop, we have to also compensate the over-estimated instructions. One of the place where such overestimated instruction is compensated happen to be the actual hot loop that the method executes. You can see my detail analysis in #43713 (comment)https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fruntime%2Fissues%2F43713%23issuecomment-772781240&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906983912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AKs3N1ZEhFiuSi0Dq%2BQ9Siw6pGkwZpX6gWduhLDE4jE%3D&reserved=0.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fruntime%2Fissues%2F43227%23issuecomment-772914110&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906983912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k0HxgSTDJXOxNGULPrXMlejqz%2FymJc%2Fy5AMHDYFvhcA%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAPELRBZH4L5R756AJOYEI73S5HQB7ANCNFSM4SKMJJGA&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906993908%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Lrq9LPegknla%2BZBF8NfSKIb30FWp0pJkGAAevYlr5GA%3D&reserved=0.

kunalspathak commented 3 years ago

What is cloned loop in the loop table? Thanks, Julie

Sometimes, if a loop has checks like array's bound check, we will clone a loop. One variant of the loop will have no bound checks (and hence will be fast loop) while other variant will have bound checks (and will be a slow loop). Depending on the input, we decide whether to execute fast loop or the cloned (slow) loop. You can see an example here. Here, fast loop is L003e ~ L0058 while cloned slow loop is L005c ~ L0080. During JIT, we track all the loops in a data structure called optLoopTable. Whenever we want to do any analysis on loops, we iterate over this table to decide whether the given loop needs any update/optimization. When we clone a loop, we do not inserted an entry of cloned loop inside optLoopTable because of which further analysis doesn't happen on cloned loop. Towards the end of my comment in https://github.com/dotnet/runtime/issues/43713#issuecomment-772781240, I have listed what type of analysis we miss out on cloned loop because of lack of recording its entry inside optLoopTable.

JulieLeeMSFT commented 3 years ago

Moved future since .NET 6 scope items are complete now.

MithrilMan commented 2 years ago

I read your article here about loop alignment here https://devblogs.microsoft.com/dotnet/loop-alignment-in-net-6/ and it mentions that a loop like this

for (int l = 0; l < M; l++) {
        // body
        OtherMethod();
    }

isn't a candidate for inlining because contains a call to a method What about methods that are inlined? Shouldn't inline method be considered like loop body and be eligible for alignment if their body is small enough?

kunalspathak commented 2 years ago

What about methods that are inlined?

Inlining decision happens in early phase of compilation and when we decide whether to align a loop or not, we already know that it was inlined or not. We won't align the loop if we know for sure that the method will not be inlined.