Open kunalspathak opened 3 years ago
@dotnet/jit-contrib , @danmosemsft
cc @DrewScoggins @billwert @adamsitnik
To state the obvious (I think) -- I believe we have good data to suggest that alignment is the dominant reason for bimodality. I'm not sure though that we can be sure there aren't other common causes for bimodality -- my assumption is that we'll find out how much is left when you've completed some of this work.
To state the obvious (I think) -- I believe we have good data to suggest that alignment is the dominant reason for bimodality. I'm not sure though that we can be sure there aren't other common causes for bimodality -- my assumption is that we'll find out how much is left when you've completed some of this work.
Yes, there will definitely be more reason for bimodality, but alignment will fix most of the obvious ones that we know of and then, it will be easier for us to focus one remaining ones. Currently, there are just too many bimodal benchmarks.
Bolded WIP texts in the plan.
With the #44370, we will start aligning inner loops at 32B boundary with padding up to 16 bytes.
The code size impact on .NET libraries (from pmi runs)
Summary of Code Size diffs:
(Lower is better)
Total bytes of base: 51622033
Total bytes of diff: 51639087
Total bytes of delta: 17054 (0.03% of base)
diff is a regression.
Summary of Allocation Size diffs:
(Lower is better)
Total bytes of base: 51940550
Total bytes of diff: 51957619
Total bytes of delta: 17069 (0.03% of base)
diff is a regression.
Details : https://github.com/dotnet/runtime/pull/44370#issuecomment-749263096
Out of 380K+ methods of .NET libraries (using PMI), based on loop alignment heuristics, we will align 1766 loops in 1515 methods. Details: https://github.com/dotnet/runtime/pull/44370#issuecomment-758133875
@DrewScoggins it will be interesting to see whether you start seeing less noise tomorrow.
Below is from weekly comparison report:
Linux x64:
Windows x64:
Windows x86:
Some of the improvements above comes from @stephentoub 's https://github.com/dotnet/runtime/pull/46895. But, the next thing I will do is analyze the regressions.
@kunalspathak from the reports that you have provided, the following benchmarks look fishy to me:
# Linux x64
System.Memory.Span<Int32>.SequenceEqual(Size: 512)
System.Memory.Span<Int32>.EndsWith(Size: 512)
Span.IndexerBench.KnownSizeCtor2(length: 1024)
System.Memory.Span<Byte>.SequenceEqual(Size: 512)
System.MathBenchmarks.Single.Max // this can be a benchmark regression, but not actual product code regression
Benchstone.BenchI.LogicArray.Test
# Windows x64
System.Text.Json.Tests.Utf8JsonReaderCommentsTests.Utf8JsonReaderCommentParsing(CommentHandling: Skip, SegmentSize: 0, TestCase: LongMultiLine)
System.Collections.ContainsFalse<String>.ICollection(Size: 512) // this can be a benchmark regression, but not actual product code regression
# Windows x86
System.Collections.ContainsKeyFalse<String, String>.Dictionary(Size: 512)
PerfLabTests.CastingPerf.CheckObjIsInterfaceYes // this can be a benchmark regression, but not actual product code regression
PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceNo // this can be a benchmark regression, but not actual product code regression
PerfLabTests.CastingPerf.CheckObjIsInterfaceNo // this can be a benchmark regression, but not actual product code regression
but there is much more improvements than regressions!
It's interesting that some of the regressing scenarios have also started becoming noisy eg System.MathBenchmarks.Single.Max
Thanks @adamsitnik . Could you please explain the data point that makes you think "this can be a benchmark regression, but not actual product code regression". It will help me in quicker investigation for these benchmarks on my part. Here is couple of benchmarks that investigated yesterday:
This looks like a real regression because of the execution of align
instruction inside JsonReaderHelper::CountNewLines
.
It was improved in November after https://github.com/dotnet/runtime/pull/42909 change, but with loop alignment changes, it went into higher end. In below screenshot:
From Vtune:
I am not sure how much vtune points to the nop
execution contribution, but my guess is that it contributes to the slowness, given that CountNewLines
is the hottest method.
I was not able to reproduce this on my machine. I also compared the disassembly and we don't add align
instruction in the benchmark code or methods that we are testing. My initial guess is that it could be because of data alignment since we allocate array during setup.
Here is the history of that test which confirms the bimodality.
If there was any code alignment issue (which I will dig more later today), the loop alignment would have addressed it and the benchmark went to the slower numbers, but data alignment in future runs might change it and we might again see the bimodality.
@danmosemsft - Thanks for pointing it out. Even we noticed it and it is interesting because it shouldn't be affected by the data alignment. I will investigate further.
Could you please explain the data point that makes you think "this can be a benchmark regression, but not actual product code regression"
Sure! Some of these benchmarks have loops inside them (while they should not) like the casting benchmark:
[Benchmark]
public bool CheckObjIsInterfaceYes()
{
bool res = false;
for (int i = 0; i < InnerIterationCount; i++)
res = myClass1 is IMyInterface1;
return res;
}
By "this can be a benchmark regression, but not actual product code regression" I meant that the part of the product that the benchmark is trying to measure has not regressed (casting in this example), but it's exercised in a loop, which shows benchmark regression.
Some of these benchmarks have loops inside them (while they should not)
Do we have plan to fix them?
Also, we noticed the following numbers for bubblesort2. The benchmark is less bimodal after loop alignment, but the remaining swing might be because of data alignment since we create an array inside the benchmark.
Here is the overall history:
and then zoomed in to the loop alignment impact:
I went through most of the regressions in https://github.com/dotnet/runtime/issues/43227#issuecomment-765684981 and here is my analysis.
As @adamsitnik pointed out, many benchmark code contains loop and the improvement/regression were mostly because those loops got aligned. The disassembly of library method that was being tested was identical (with some exceptions, see below). Best example of this is MathBenchmark.Single.Max
. @danmosemsft .
There were benchmarks that are heavily impacted by data alignment because of allocation happening in benchmark code itself. Most of the array benchmarks fall in this category. So the stability might have not impacted because of loop alignment or in certain cases, the numbers were more stable than before, but the additional bimodality is because of data alignment. E.g. Bubble sort (see below).
For x86, we should improve the encoding we use for multiple size NOP
instructions. Today, we just output repeated single byte 90
, but could do better like we do for x64.
Lastly, there are places that I have noticed, where we emit "align" instruction in blocks that are expensive. We should find a way to instead put it either after "jmp" or "ret" instruction that comes before the "align" instruction (so the NOP
are never fetched by the decoder and decoded) or add them in one of the preceding blocks that has lower weight. We already have a work item "Advanced padding" to do the part of squeezing pads after jmp
or ret
(wherever possible), but sometimes we might not find such instructions that provides blind spot to insert align
instruction. In such case, we should do best effort to insert align
in a block that has lower weight than the one that precedes immediately the loop block. Most prominent reason to do this work is because currently many hot SpanHelper APIs like IndexOf
, IndexOfAny
, Contains
and SequenceCompare
include align
instruction that is greater than 4 bytes. Placing the align
instruction appropriately should hopefully improve these methods.
To conclude, I did not see any surprises in my analysis and we would expect these results from loop alignment. I am working with @DrewScoggins to get similar report on how much stability did "loop alignment" improved.
We do not do any loop alignment in benchmark code because the loop has calls. Historically, the test is very flaky.
I am not sure why the benchmark calls into a test method that contains the actual benchmark code. But @adamsitnik is right about this benchmark. We add 4 bytes of NOP compensation during the prolog and then 4 bytes of alignment before the 2nd loop, but then, none of the alignment happens for the actual library code (new Span()
) that the benchmark is testing. So the regression might be from the extra 8 bytes of NOP/align instruction, but it shouldn't affect the library code. Perhaps, we should change this benchmark to not have loop and just have benchmark code in the method that is annotated with [Benchmark]
.
These might be real regression, although from Vtune profiling, it was not that obvious. However, key observation was that we added align
instruction in some of the hot libraries methods and by doing "advanced padding", we will be able to get back some of the regressions. For instance, in assembly code for SpanHelpers.Contains(), we added align
instruction in a block that had weight of 4 although there were colder blocks before it where we could have added this align
instruction.
For this particular benchmark, I gathered all the align
instruction we added in various methods that were JITted during the execution of the benchmark. Note that, some of them might not be contributing to the benchmark measurement.
Method Name | Align bytes | Block weight |
---|---|---|
System.SpanHelpers:SequenceCompareTo(byref,int,byref,int):int | 4 | 0.5 |
System.SpanHelpers:IndexOfAny(byref,ushort,ushort,ushort,int):int | 3 | 0.5 |
System.SpanHelpers:Contains(byref,ushort,int):bool | 4 | 4 |
Newtonsoft.Json.DefaultJsonNameTable:Add(System.String):System.String:this | 7 | 0.5 |
System.SpanHelpers:IndexOf(byref,ubyte,int):int | 5 | 2 |
Newtonsoft.Json.Utilities.JavaScriptUtils:ShouldEscapeJavaScriptString(System.String,System.Boolean[]):bool | 3 | 0.5 |
System.SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int | 12 | 0.5 |
Newtonsoft.Json.DefaultJsonNameTable:Add(System.String):System.String:this | 7 | 0.5 |
This one was little surprising as the benchmark is straightforward. We do not align the loop inside benchmark because of the presence of call. So I am not sure why it would regress.
public bool CheckObjIsInterfaceYes()
{
bool res = false;
for (int i = 0; i < InnerIterationCount; i++)
res = myfoo is IFoo;
return res;
}
Like previous one, this too has a loop and as @adamsitnik pointed, although the benchmark regresses, the product code doesn't. In this benchmark, we disable VEX prefix encoding optimization for 2 instructions that are part of benchmark code. Then we add 1 byte
align instruction. So the regression is coming from it.
As I mentioned earlier, this is a typical array benchmark that contains allocation (and hence data alignment issues). In this benchmark, we align the inner loop, but there is an enclosing loop that would process the added padding for alignment.
Here too, we "align" inner loop, but those padding instructions get executed by the outer loop and might have seen regression.
We should just delete this benchmark or trim it. This does array allocation inside benchmark code and the measurements can be prone to GC behavior, data alignment, etc. I didn't deep dive into this one looking at the historical data of it.
Finally, here is my favorite. This one too has data alignment, but after fixing the code alignment, we see less swing of measurements. In below graph, the first drop around October came after we started aligning methods at 32B boundary in https://github.com/dotnet/runtime/pull/42909 and the 2nd drop came after "loop alignment".
Zoomed to "loop alignment" change:
All this happens because we align the inner loop present in Bubble sort. Note that we have added 5 bytes padding in IG03
to align the loop. However, IG03
has weight=4
because it is part of the outer loop. We could get more performance benefits had we added the padding in IG01
or IG02
which has weight=1
.
Thanks to @adamsitnik for helping me do faster analysis by pointing at the presence of loop inside benchmark code.
Nice analysis @kunalspathak. There were 2 or 3 todo's in your summary above, do they need issues, and listing in the tracking list at the top of this issue? I am happy to see this continuing progress.
As an aside, is there more than BDN can do to help regularize data alignment, when the buffer is allocated by the test and reused? Presumably by doing some tricks with padding in a struct and some unsafe code one can organize an array to have a certain alignment. Maybe BDN could handle such ugly tricks and expose an API for a test to allocate an array nicely aligned. @adamsitnik
Another benchmark that I would like to highlight the impact on stability and performance is LoopReturn
:
Ubuntu x64
Windows x64
There were 2 or 3 todo's in your summary above, do they need issues, and listing in the tracking list at the top of this issue?
Some of them were tracked in this issue's description. I have updated it to include my other findings.
As an aside, is there more than BDN can do to help regularize data alignment, when the buffer is allocated by the test and reused?
Never mind, I forgot about https://github.com/dotnet/performance/pull/1587 etc. we're attacking both sides. 😊
Another round of analysis.
I went through the stability data for windows x64 and linux x64 that @DrewScoggins shared with me. In this report, "regression" means that loop alignment introduced more variance to the measurements and thus made the benchmark instable, while "improvements" meant that the loop alignment work actually stabilized the benchmark and reduced the variance. I analyzed around 150 benchmarks to see if the regressions were real by checking at the historical data of Windows x64 and Linux x64 for each of the 150 benchmark. If I see that the benchmark had lot of variance historically, I concluded that the loop alignment didn't added to the variance. There were few exceptions like Perf_Enumerable.Count
and Single.Max
where post-loop alignment we saw some variance, but other than that, most of the regression pointed out in report were not related to the loop alignment work.
Here are some of the take away points:
JIT maintains a loop table to track all the loops it has seen and use it to perform various optimizations and analysis. It turns out that there is a bug (or a missing feature, whatever you call) where we do not record the cloned loop inside loop table. With that, we miss out opportunities to do the required analysis on cloned loops. Tracking issue: https://github.com/dotnet/runtime/issues/43713
We sometimes add "NOP compensation" for the over-estimated instruction. Most of the over-estimated instructions are jumps and that that happens because branch tightening happens before loop alignment adjustment phase. In branch tightening we adjust the size of jump instructions depending on the target offset, but later, the loop alignment adjustment can further shorten the IG offsets resulting in over-estimated jump instruction. To address this, we need to combine branch tightening and loop alignment adjustment in same phase rather than separating them. I have updated the work item list to capture this explicitly. After doing this, we might further think about adding the compensation behind jmp
or in cold block.
Often the encoding of jump instruction before vs. after loop alignment might increase because of added padding instructions. I do not think we can do much about it.
The regression happens because of extra padding we added to align some of the loops inside benchmark code. Asmdiffs: https://www.diffchecker.com/ryZqqO6I.
The regression happens because of addition of 5 "NOP overestimate compensation" in hot blocks. The overestimation happened for jump instructions. The way to address is by combining branch tightening with loop alignment adjustments (already planned work). Asmdiffs: https://www.diffchecker.com/IcktyNPi
In this method, TryParseUInt16D()
is the hot method as seen from VTune screenshot:
With loop alignment, we ended up adding padding at the end of hot block that VTune pointed above (Block 2). Asmdiffs: https://www.diffchecker.com/jHt5pv99
Although this shows regression, I verified that there is no loop inside the benchmark, nor a loop in the libraries method that it test. Further, I also compared the asm diffs and there is no "align" instruction.
This one is interesting. The hot method for all these benchmarks is GenericEqualityComparer`1[Canon][System.Canon]:IndexOf. In this method, we clone a loop, but later decide to not align it because it had call. However, we do not record that fact in the cloned loop because of which we have to align the cloned loop containing call. Additionally, until we emit the padding in the cloned loop, we have to also compensate the over-estimated instructions. One of the place where such overestimated instruction is compensated happen to be the actual hot loop that the method executes. You can see my detail analysis in https://github.com/dotnet/runtime/issues/43713#issuecomment-772781240.
Good analysis that loop alignment work didn’t introduce regression.
What is cloned loop in the loop table?
Thanks, Julie
From: Kunal Pathak notifications@github.com Sent: Wednesday, February 3, 2021 4:04:47 PM To: dotnet/runtime runtime@noreply.github.com Cc: Julie Lee Julie.Lee@microsoft.com; State change state_change@noreply.github.com Subject: Re: [dotnet/runtime] Stabilize performance measurement (#43227)
Another round of analysis.
Stability
I went through the stability data for windows x64https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpvscmdupload.blob.core.windows.net%2Freports%2FallTestHistory%2Frefs%2Fheads%2Fmaster_x64_Windows%252010.0.18362%2FAllTestindex.html&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906853966%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=JzoKgDiI6s4aPuzmEKLXLKYP4p6D3%2FFBUcfTHMqQTh8%3D&reserved=0 and linux x64https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpvscmdupload.blob.core.windows.net%2Freports%2F01_18_2021%2Freport_Weekly_ca%3Dx64_cb%3Drefs-heads-master_co%3DUbuntu1804_cr%3Ddotnetruntime_cc%3DCompliationMode%3Dtiered-RunKind%3Dmicro_2021-01-18.html%23Improvements&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906863969%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=0Sh17Vl9rR2A1VHzGPvVg6o4sdok3Oo8xHLIGOE1n68%3D&reserved=0 that @DrewScogginshttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FDrewScoggins&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906873964%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jC4vx7YwI%2BkeytbEAgtoYZ5olJYzSwRrMRIXtuxH9Yw%3D&reserved=0 shared with me. In this report, "regression" means that loop alignment introduced more variance to the measurements and thus made the benchmark instable, while "improvements" meant that the loop alignment work actually stabilized the benchmark and reduced the variance. I analyzed around 150 benchmarks to see if the regressions were real by checking at the historical data of Windows x64https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpvscmdupload.blob.core.windows.net%2Freports%2FallTestHistory%2Frefs%2Fheads%2Fmaster_x64_Windows%252010.0.18362%2FAllTestindex.html&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906873964%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=a0d01xOt9ES%2FDK47JR%2Bkte69X56LDC8eFrZigjIkED8%3D&reserved=0 and Linux x64https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpvscmdupload.blob.core.windows.net%2Freports%2FallTestHistory%2Frefs%2Fheads%2Fmaster_x64_ubuntu%252018.04%2FAllTestindex.html&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906883953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=iQCzzCL2iCnirJk6AdsVNDHx9F3roGI1ooUe5yOwflY%3D&reserved=0 for each of the 150 benchmark. If I see that the benchmark had lot of variance historically, I concluded that the loop alignment didn't added to the variance. There were few exceptions like Perf_Enumerable.Count and Single.Max where post-loop alignment we saw some variance, but other than that, most of the regression pointed out in report were not related to the loop alignment work.
Performance
Here are some of the take away points:
JIT maintains a loop table to track all the loops it has seen and use it to perform various optimizations and analysis. It turns out that there is a bug (or a missing feature, whatever you call) where we do not record the cloned loop inside loop table. With that, we miss out opportunities to do the required analysis on cloned loops. Tracking issue: #43713https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fruntime%2Fissues%2F43713&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906893953%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=J5enp6slL3eR8bofOKJRQ%2FPWD%2FcytuF5%2BJ%2FdZeAO%2FpE%3D&reserved=0
We sometimes add "NOP compensation" for the over-estimated instruction. Most of the over-estimated instructions are jumps and that that happens because branch tightening happens before loop alignment adjustment phase. In branch tightening we adjust the size of jump instructions depending on the target offset, but later, the loop alignment adjustment can further shorten the IG offsets resulting in over-estimated jump instruction. To address this, we need to combine branch tightening and loop alignment adjustment in same phase rather than separating them. I have updated the work item list to capture this explicitly. After doing this, we might further think about adding the compensation behind jmp or in cold block.
Often the encoding of jump instruction before vs. after loop alignment might increase because of added padding instructions. I do not think we can do much about it.
FannkuchRedux_9
The regression happens because of extra padding we added to align some of the loops inside benchmark code. Asmdiffs: https://www.diffchecker.com/ryZqqO6Ihttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.diffchecker.com%2FryZqqO6I&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906903947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XFuE8AbjSsS%2B1%2Bpz4%2FrC9WrMKjw1s%2FvR8T6M9X6jSUg%3D&reserved=0.
MulMatrix
The regression happens because of addition of 5 "NOP overestimate compensation" in hot blocks. The overestimation happened for jump instructions. The way to address is by combining branch tightening with loop alignment adjustments (already planned work). Asmdiffs: https://www.diffchecker.com/IcktyNPihttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.diffchecker.com%2FIcktyNPi&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906913945%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rN6BDyNV4K0SxjzZspVE0HiLjWSTnNFxwg9qc72mFic%3D&reserved=0
TryGetUInt16:
In this method, TryParseUInt16D() is the hot method as seen from VTune screenshot:
[image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106823363-a9066300-6635-11eb-9af3-66c59035bc18.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906933933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dDmX2yMppcyPM74kCdYxNOqxleIveSFuOrdGM9NNwGg%3D&reserved=0 [image]https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F12488060%2F106823367-ab68bd00-6635-11eb-91c3-7d0d4d80a5ac.png&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906933933%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8BT8fuTpmy05kjuMYqijLaCQvci84j9HcWexcFpXSw4%3D&reserved=0
With loop alignment, we ended up adding padding at the end of hot block that VTune pointed above (Block 2). Asmdiffs: https://www.diffchecker.com/jHt5pv99https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.diffchecker.com%2FjHt5pv99&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906943929%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=H0s2zpi8O9CZmNbx4LzRh%2BdCKb%2Fo8JaIkBc42CN81Us%3D&reserved=0
GetTypeCode
Although this shows regression, I verified that there is no loop inside the benchmark, nor a loop in the libraries method that it test. Further, I also compared the asm diffs and there is no "align" instruction.
ContainsTrue/ContainsFalse
This one is interesting. The hot method for all these benchmarks is GenericEqualityComparer`1[Canon][System.Canon]:IndexOfhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsource.dot.net%2F%23System.Private.CoreLib%2Fsrc%2FSystem%2FCollections%2FGeneric%2FEqualityComparer.CoreCLR.cs%2C19&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906973917%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=toPQAni6Sdcvh%2FhulUnwGD4NQDMT6uOHyPNo62h0VLo%3D&reserved=0. In this method, we clone a loop, but later decide to not align it because it had call. However, we do not record that fact in the cloned loop because of which we have to align the cloned loop containing call. Additionally, until we emit the padding in the cloned loop, we have to also compensate the over-estimated instructions. One of the place where such overestimated instruction is compensated happen to be the actual hot loop that the method executes. You can see my detail analysis in #43713 (comment)https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fruntime%2Fissues%2F43713%23issuecomment-772781240&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906983912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AKs3N1ZEhFiuSi0Dq%2BQ9Siw6pGkwZpX6gWduhLDE4jE%3D&reserved=0.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fruntime%2Fissues%2F43227%23issuecomment-772914110&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906983912%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k0HxgSTDJXOxNGULPrXMlejqz%2FymJc%2Fy5AMHDYFvhcA%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAPELRBZH4L5R756AJOYEI73S5HQB7ANCNFSM4SKMJJGA&data=04%7C01%7Cjulie.lee%40microsoft.com%7Cfe1560c80afe4e308df708d8c8a07c96%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637479938906993908%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Lrq9LPegknla%2BZBF8NfSKIb30FWp0pJkGAAevYlr5GA%3D&reserved=0.
What is cloned loop in the loop table? Thanks, Julie
Sometimes, if a loop has checks like array's bound check, we will clone a loop. One variant of the loop will have no bound checks (and hence will be fast loop) while other variant will have bound checks (and will be a slow loop). Depending on the input, we decide whether to execute fast loop or the cloned (slow) loop. You can see an example here. Here, fast loop is L003e ~ L0058
while cloned slow loop is L005c ~ L0080
. During JIT, we track all the loops in a data structure called optLoopTable
. Whenever we want to do any analysis on loops, we iterate over this table to decide whether the given loop needs any update/optimization. When we clone a loop, we do not inserted an entry of cloned loop inside optLoopTable
because of which further analysis doesn't happen on cloned loop. Towards the end of my comment in https://github.com/dotnet/runtime/issues/43713#issuecomment-772781240, I have listed what type of analysis we miss out on cloned loop because of lack of recording its entry inside optLoopTable
.
Moved future since .NET 6 scope items are complete now.
I read your article here about loop alignment here https://devblogs.microsoft.com/dotnet/loop-alignment-in-net-6/ and it mentions that a loop like this
for (int l = 0; l < M; l++) {
// body
OtherMethod();
}
isn't a candidate for inlining because contains a call to a method What about methods that are inlined? Shouldn't inline method be considered like loop body and be eligible for alignment if their body is small enough?
What about methods that are inlined?
Inlining decision happens in early phase of compilation and when we decide whether to align a loop or not, we already know that it was inlined or not. We won't align the loop if we know for sure that the method will not be inlined.
The code quality and performance of RyuJIT is tracked internally by running MicroBenchmarks in our performance lab. We regularly triage the performance issues opened by the .NET performance team. After going through these issues for past several months, we have identified some key points.
Stability
Many times, the set of commits that are flagged as introducing regression in a benchmark, do not touch the code that is tested in the benchmark. In fact, the assembly code generated for the .NET code that is being tested is often identical and yet the measurements show differences. Some of our investigation reveals that the fluctuation in the benchmark measurements happen because of the misalignment of generated JIT code in process memory. Below is an example of LoopReturn benchmark that shows such behavior.
It is very time consuming for .NET developers to do the analysis of benchmarks that regressed because of things that are out of control of .NET runtime. In the past, we have closed several issues like https://github.com/dotnet/runtime/issues/13770, https://github.com/dotnet/runtime/issues/39721 and https://github.com/dotnet/runtime/issues/39722 because they were regressions because of code alignment. A great example that we found out while investigating those issues was the change introduced in https://github.com/dotnet/runtime/pull/38586 eliminated a
test
instruction and should have showed improvement in the benchmarks, but introduced regression because the code (loop code inside method) now gets misaligned and the method runs slower.Alignment issues was brought up few times in https://github.com/dotnet/runtime/issues/9912 and https://github.com/dotnet/runtime/issues/8108 and this issue tracks the progress towards the goal of stabilizing and possibly improving the performance of .NET apps that are heavily affected because of code alignment.
Performance lab infrastructure
Once we address the code alignment issue, the next big thing will be to identify and make required infrastructure changes in our performance lab to make sure that it can easily flag such issues without needing much interaction from .NET developers. For example, https://github.com/dotnet/BenchmarkDotNet/issues/1513 proposes to make memory alignment in the benchmark run random to catch these issues early and once we address the underlying problem in .NET, we should never see bimodal behavior of those benchmarks. After that, if the performance lab does find a regression in the benchmark, we need to have robust tooling support to get possible metrics from performance runs so that a developer doing the investigation can easily identify the cause of regression. For example, identifying the time spent in various phases of .NET runtime like Jitting, Jit interface, Tier0/Tier1 JIT code, hot methods, instructions retired during benchmark execution and so forth.
Reliable benchmarks collection
Lastly, for developers working on JIT, we want to identify set of benchmarks that are stable enough and can be trusted to give us reliable measurement whenever there is a need to verify the performance for changes done to the JIT codebase. This will help us conduct performance testing ahead of time and identify potential regressions rather than waiting it to happen in performance lab.
Here are set of work items that we have identified to achieve all the above:
Code alignment work
Future work
jmp
orret
instruction that comes beforealign
instruction.NOP
instructions. Today, we just output repeated single byte90
, but could do better like we do for x64.Performance tooling work