Regressions in System.Text.Json.Document.Tests.Perf_EnumerateArray

danmoseley commented 2 years ago

broken out of https://github.com/dotnet/runtime/issues/67101

Commit range: https://github.com/dotnet/runtime/compare/6bf873a991bcae3f80f5de155a594cefc8824eea...bc5e386676ec5c806eef78d5fa754f1eadfe28c2.

Seems the only relevant change is JIT: enable reading PGO data when switching to optimized (https://github.com/dotnet/runtime/pull/66618[)](https://github.com/dotnet/runtime/commit/2737da5817ab121fb43ff5577d1e146e0ec6d213)

category:performance theme:benchmarks

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.

Issue Details

broken out of https://github.com/dotnet/runtime/issues/67101 ![image](https://user-images.githubusercontent.com/6385855/160213493-305e3c01-f481-4070-91e1-b777e70f9c03.png) Commit range: https://github.com/dotnet/runtime/compare/6bf873a991bcae3f80f5de155a594cefc8824eea...bc5e386676ec5c806eef78d5fa754f1eadfe28c2. Seems the only relevant change is [JIT: enable reading PGO data when switching to optimized (](https://github.com/dotnet/runtime/commit/2737da5817ab121fb43ff5577d1e146e0ec6d213)https://github.com/dotnet/runtime/pull/66618[)](https://github.com/dotnet/runtime/commit/2737da5817ab121fb43ff5577d1e146e0ec6d213)

Author:	danmoseley
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

AndyAyersMS commented 2 years ago

Hopefully addressed by the PGO update we're waiting for... cc @EgorBo

If not I'll take a look...

AndyAyersMS commented 2 years ago

Windows ARM64

newplot - 2022-04-25T135849 805

Related benchmark shows similar but smaller regression

newplot - 2022-04-25T140435 617

Windows x64 shows similar but much smaller regression

newplot - 2022-04-25T140117 624

These are likely the same issues we see in #67318 where block reordering driven by PGO data is interfering with loop optimizations. Will need to drill in to be sure.

AndyAyersMS commented 2 years ago

Regression has persisted even after we deferred some block reordering in https://github.com/dotnet/runtime/pull/69878 newplot - 2022-06-10T105854 992

Need to investigate this one.

AndyAyersMS commented 2 years ago

Can repro this locally, digging in.

Method	Job	Toolchain	TestCase	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
EnumerateArray	Job-YJYIXG	\base-rel\corerun.exe	Json400KB	2.052 us	0.0020 us	0.0019 us	2.052 us	2.049 us	2.055 us	1.00	-	NA
EnumerateArray	Job-SGSSCH	\diff-rel\corerun.exe	Json400KB	2.400 us	0.0045 us	0.0042 us	2.398 us	2.396 us	2.410 us	1.17	-	NA

AndyAyersMS commented 2 years ago

Trying to find a way to run these with the -p ETW but running into issues because the host is arm64.

Got past it by updating BDN to use BenchmarkDotNet.Diagnostics.Windows 3.0.1 and doing a private build (which I had to do over on an x64 box, see https://github.com/dotnet/BenchmarkDotNet/issues/2028) and then updating the performance repo to use this build of BDN.

AndyAyersMS commented 2 years ago

Assembly for the key loops. Main differences are:

the block layout. In the after picture one of the loop blocks (IG25) ends up being placed after some non loop blocks (not shown)
the address of the callee CheckNotDisposed is indirect in the after picture, and the indirection cell requires a large constant materialization
there is one missed CSE.

Likely it is the difference in the call to CheckNotDisposed that accounts for most of the perf difference. Because the indir introduced for the call only shows up in lowering it's not a candidate for CSE/hoisting (there are two other call sites in this method). We should possibly expand this earlier.

Hoisting the IAT_PVALUE indir may end up being risky as the indir target presumably might get changed. Perhaps for frequently running loops we could split the loop somehow and refresh the hoisted value every so often.

But we could easily hoist/cse the creation of the indir addresses.

Before

After

```asm G_M9251_IG17: ;; offset=0180H B9404BB3 ldr w19, [fp,#72] B9404FB4 ldr w20, [fp,#76] 6B14027F cmp w19, w20 5400062A bge G_M9251_IG29 G_M9251_IG18: ;; offset=0190H 7100027F cmp w19, #0 540000AA bge G_M9251_IG20 G_M9251_IG19: ;; offset=0198H B9405BA0 ldr w0, [fp,#88] 11003000 add w0, w0, #12 B9004BA0 str w0, [fp,#72] 1400001D b G_M9251_IG24 G_M9251_IG20: ;; offset=01A8H F9402BB5 ldr x21, [fp,#80] 39C002BF ldrsb wzr, [x21] AA1503E0 mov x0, x21 ;; CheckNotDisposed 97F95131 bl ** F9401AA0 ldr x0, [x21,#48] B50000A0 cbnz x0, G_M9251_IG21 350003B3 cbnz w19, G_M9251_IG27 AA1F03E1 mov x1, xzr 2A1F03E2 mov w2, wzr 14000009 b G_M9251_IG22 G_M9251_IG21: ;; offset=01D0H B9400801 ldr w1, [x0,#8] 6B13003F cmp w1, w19 540002E3 blo G_M9251_IG27 91004002 add x2, x0, #16 4B130020 sub w0, w1, w19 2A1303E1 mov w1, w19 8B010041 add x1, x2, x1 2A0003E2 mov w2, w0 G_M9251_IG22: ;; offset=01F0H 7100305F cmp w2, #12 540001AB blt G_M9251_IG26 B940003F ldr wzr, [x1] B9400820 ldr w0, [x1,#8] 531C7C01 lsr w1, w0, #28 53001C21 uxtb w1, w1 7100143F cmp w1, #5 5400018B blt G_M9251_IG28 11003261 add w1, w19, #12 G_M9251_IG23: ;; offset=0214H B9004BA1 str w1, [fp,#72] G_M9251_IG24: ;; offset=0218H B9404BA0 ldr w0, [fp,#72] 6B14001F cmp w0, w20 54FFFB0B blt G_M9251_IG17 ```

```asm G_M9251_IG14: ;; offset=0168H B94043B4 ldr w20, [fp,#64] B94047B6 ldr w22, [fp,#68] 6B16029F cmp w20, w22 540007AA bge G_M9251_IG27 G_M9251_IG15: ;; offset=0178H 7100029F cmp w20, #0 540000AA bge G_M9251_IG17 G_M9251_IG16: ;; offset=0180H B94053A0 ldr w0, [fp,#80] 11003000 add w0, w0, #12 B90043A0 str w0, [fp,#64] 1400001D b G_M9251_IG21 G_M9251_IG17: ;; offset=0190H F94027B3 ldr x19, [fp,#72] 39C0027F ldrsb wzr, [x19] AA1303E0 mov x0, x19 D2935001 movz x1, #0x9a80 F2AE1C21 movk x1, #0x70e1 LSL #16 F2CFFFA1 movk x1, #0x7ffd LSL #32 F9400021 ldr x1, [x1] ;; CheckNotDisposed D63F0020 blr x1 F9401A60 ldr x0, [x19,#48] B4000480 cbz x0, G_M9251_IG25 B9400801 ldr w1, [x0,#8] 6B14003F cmp w1, w20 54000283 blo G_M9251_IG23 91004001 add x1, x0, #16 2A1403E2 mov w2, w20 8B020021 add x1, x1, x2 B9400800 ldr w0, [x0,#8] ;; -CSE 4B140002 sub w2, w0, w20 G_M9251_IG18: ;; offset=01D8H 7100305F cmp w2, #12 5400026B blt G_M9251_IG24 B940003F ldr wzr, [x1] B9400820 ldr w0, [x1,#8] 531C7C01 lsr w1, w0, #28 53001C21 uxtb w1, w1 7100143F cmp w1, #5 5400030B blt G_M9251_IG26 G_M9251_IG19: ;; offset=01F8H 11003281 add w1, w20, #12 G_M9251_IG20: ;; offset=01FCH B90043A1 str w1, [fp,#64] G_M9251_IG21: ;; offset=0200H B94043A0 ldr w0, [fp,#64] 6B16001F cmp w0, w22 54FFFB0B blt G_M9251_IG14 G_M9251_IG25: ;; offset=0244H 35FFFE74 cbnz w20, G_M9251_IG23 AA1F03E1 mov x1, xzr 2A1F03E2 mov w2, wzr 17FFFFE2 b G_M9251_IG18 ```

AndyAyersMS commented 2 years ago

I cherry-picked the changes for https://github.com/dotnet/runtime/pull/70988 and while I see more compact call in AFTER case, it doesn't make the benchmark any faster. So seemingly the perf issue lies elsewhere?

Profiling is strongly implicating code within the method. Comparing two BDN runs with same number of iterations/invocations we see all the extra exclusive cycles in EnumerateArray.

AndyAyersMS commented 2 years ago

Looking at the raw sample data from ETL it seems to suggest this bit of code in the AFTER version, right after the bl is especially costly:

        F9401A60          ldr     x0, [x19,#48]
        B4000480          cbz     x0, G_M9251_IG25
        B9400801          ldr     w1, [x0,#8]
        6B14003F          cmp     w1, w20
        54000283          blo     G_M9251_IG23

The BEFORE code layout inverts the first branch.

x0 is never zero here, so the branch is always/never taken in BEFORE/AFTER and should be quite predictable. I can't see offhand why this bit of code should be slower; seems like a not-taken branch should be cheaper in general. There is a dependence chain here but it's the same in the BEFORE code.

AndyAyersMS commented 2 years ago

Ubuntu arm64 shows similar regression at the same change. Feels like I must be missing something obvious here.

newplot - 2022-06-30T160523 677

cc @EgorBo

AndyAyersMS commented 2 years ago

Also (note to self) perfview will dump the raw counts, if you tell it to annotate the source for a method, and look in the log, you'll see something like:

GetSourceLine: Getting source line for code address index 0000215A
GetSourceLine: address for code address is 7ffd70e54ad4 module microbenchmarks
GetSourceLine: Found JITTed method System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateArray(), index 0000086E token 6001add
GetSourceLine: Found an il-to-native mapping MethodIdx 0000086E Start 7ffd70e54a80 Len 33c
GetSourceLine: NativeOffset 54 ILOffset = c
FindSymbolFilePath: *{ Locating PDB c:\repos\performance\artifacts\obj\MicroBenchmarks\Release\net7.0\MicroBenchmarks.pdb GUID d7c3c786-f5ca-4fea-a8a2-646b668b2aba Age 1 Version 
FindSymbolFilePath: Pdb is for DLL c:\repos\performance\artifacts\bin\microbenchmarks\release\net7.0\e3f010d1-375f-440a-8d2c-94eda8290ad4\bin\release\net7.0\publish\microbenchmarks.dll
FindSymbolFilePath: } Hit Cache, returning c:\repos\performance\artifacts\obj\MicroBenchmarks\Release\net7.0\MicroBenchmarks.pdb
Opened Pdb file c:\repos\performance\artifacts\obj\MicroBenchmarks\Release\net7.0\MicroBenchmarks.pdb
Opened Portable Pdb Source File: c:\repos\performance\src\benchmarks\micro\libraries\System.Text.Json\Document\EnumerateArray.cs

Metric as a function of code address
      Address    :   Line     Metric
    7ffd70e54a94 :     57        1.0
    7ffd70e54aa8 :     57        1.0
    7ffd70e54ab4 :     57        3.0
    7ffd70e54ab8 :     57        1.0
    7ffd70e54ad4 :     57        3.0
    7ffd70e54ad8 :     57        3.0
    7ffd70e54ae4 :     57        1.0
    7ffd70e54aec :     57        1.0
    7ffd70e54b08 :     57        1.0
    7ffd70e54b10 :     57        2.0
    7ffd70e54b20 :     57        2.0
    7ffd70e54b60 :      0        1.0
    7ffd70e54b6c :      0        1.0
    7ffd70e54b88 :      0        1.0
    7ffd70e54bc4 :      0        4.0
    7ffd70e54bdc :      0        7.0
    7ffd70e54be8 :      0      206.0
    7ffd70e54bec :      0       25.0
    7ffd70e54bf0 :      0       29.0

This gives similar (though not exactly the same) data as my instructions retired tool with the new -show-samples option, which for the AFTER run above produces:

Samples for corerun: 5935 events for Benchmark Intervals
Jitting           : 00.52% 2.3E+05  samples 1729 methods
  JitInterface    : 00.23% 1E+05    samples
Jit-generated code: 99.28% 4.41E+07 samples
  Jitted code     : 99.28% 4.41E+07 samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 00.00% 0        samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 99.28% 4.41E+07 samples
  R2R code        : 00.00% 0        samples

93.09%   4.136E+07   Tier-1   [MicroBenchmarks]Perf_EnumerateArray.EnumerateArray()
05.96%   2.65E+06    Tier-1   [System.Text.Json]JsonDocument.CheckNotDisposed()
00.29%   1.3E+05     native   clrjit.dll
00.23%   1E+05       Tier-1   [e3f010d1-375f-440a-8d2c-94eda8290ad4]Runnable_0.WorkloadActionUnroll(int64)
00.14%   6E+04       native   coreclr.dll
00.14%   6E+04       native   ntoskrnl.exe
Raw samples for [MicroBenchmarks]Perf_EnumerateArray.EnumerateArray() at 0x00007FFD70E54A80 -- 0x00007FFD70E54DBC (length 0x033C)
0x0014 : 1
0x0028 : 1
0x0034 : 3
0x0038 : 1
0x0058 : 3
0x0064 : 1
0x006C : 1
0x0088 : 1
0x0090 : 2
0x00A0 : 2
0x00E0 : 1
0x00EC : 1
0x0108 : 1
0x0144 : 4
0x015C : 6
0x0168 : 186

The difference may be that my tool uses the exact BDN events to filter the samples, while the perfview report above is using an "include filter" on WorkloadActionUnroll and so may capture a bit more.

AndyAyersMS commented 2 years ago

Similar results on the M1

BenchmarkDotNet=v0.13.1.1786-nightly, OS=macOS Monterey 12.3 (21E230) [Darwin 21.4.0]
Apple M1 Max, 1 CPU, 10 logical and 10 physical cores
.NET SDK=7.0.100-preview.2.22153.17
  [Host]     : .NET 6.0.3 (6.0.322.12309), Arm64 RyuJIT
  Job-VMNNRZ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT
  Job-DELMRH : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  MaxIterationCount=20  
MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	TestCase	Mean	Error	StdDev	Median	Min	Max	Ratio	Allocated	Alloc Ratio
EnumerateArray	Job-VMNNRZ	AFTER	Json400KB	2.314 us	0.0020 us	0.0018 us	2.314 us	2.313 us	2.318 us	1.22	-	NA
EnumerateArray	Job-DELMRH	BEFORE	Json400KB	1.901 us	0.0030 us	0.0027 us	1.902 us	1.895 us	1.905 us	1.00	-	NA

EgorBo commented 2 years ago

@AndyAyersMS so if I understand you correctly you blame an indirect call to CallNotDisposed (via PRECODE or maybe even via jump-stub + precode)

EgorBo commented 2 years ago

@AndyAyersMS

it doesn't make the benchmark any faster. So seemingly the perf issue lies elsewhere?

Good news, turns out my PR https://github.com/dotnet/runtime/pull/70988 doesn't need && ftn->HasNativeCodeSlot() check. With it being removed I now see this codegen diff: https://www.diffchecker.com/nKMTF7nE

EgorBo commented 2 years ago

However, the actual improvement from the direct call is still small (~1%)

AndyAyersMS commented 2 years ago

Right, the older version of #70988 had the same effect on the codegen and didn't fix the perf issue here.

EgorBo commented 2 years ago

The benchmark is basically "fixed" with https://github.com/dotnet/runtime/pull/71534

AndyAyersMS commented 2 years ago

Interesting. Would still be nice to understand why there was a regression here. Since it repros on many different devices it seems like it is something fundamental and not just a microarchitectural problem. Seems like there might be something important to be learned.

EgorBo commented 2 years ago

I agree, but I wasn't able to identify anything unusual so far in VTune on Windows-x64, I guess I need to do it on Windows-ARM64 where regression is way more visible. On win-x64 it looks like it slightly fixed itself over time:

AndyAyersMS commented 2 years ago

My current hunch is that something in the stub chain between this method and CheckIfNotDisposed ends up blowing the return branch predictor and this is why we see extra sample hits just after the bl. But I could be wrong. Trying to step through the actual code sequence now to see for sure.

EgorBo commented 2 years ago

Btw, performance was slightly faster (>2%) if I remove this condition + https://github.com/dotnet/runtime/pull/70988 (without && ftn->HasNativeCodeSlot() check)

It means - just call absolute address, don't try to reloc it (or emit jump stubs)

AndyAyersMS commented 2 years ago

Running on an ampere, from perf, overview:

BASE

Performance counter stats for 'dotnet run -c Release -f net7.0 -- --filter System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateArray --parameter-filter TestCase:Json400KB --corerun /home/andya/bugs/r67176/base-rel/corerun --iterationCount 100 --invocationCount 100000':

         53,735.74 msec task-clock                #    1.185 CPUs utilized          
            50,373      context-switches          #  937.421 /sec                   
                91      cpu-migrations            #    1.693 /sec                   
           422,809      page-faults               #    7.868 K/sec                  
   152,920,646,006      cycles                    #    2.846 GHz                    
   230,502,866,275      instructions              #    1.51  insn per cycle         
   <not supported>      branches                                                    
       986,085,875      branch-misses                                               

      45.357072152 seconds time elapsed

      49.674995000 seconds user
       4.062895000 seconds sys

DIFF

Performance counter stats for 'dotnet run -c Release -f net7.0 -- --filter System.Text.Json.Document.Tests.Perf_EnumerateArray.EnumerateArray --parameter-filter TestCase:Json400KB --corerun /home/andya/bugs/r67176/diff-rel/corerun --iterationCount 100 --invocationCount 100000':

         61,693.81 msec task-clock                #    1.185 CPUs utilized          
            48,765      context-switches          #  790.436 /sec                   
                91      cpu-migrations            #    1.475 /sec                   
           430,535      page-faults               #    6.979 K/sec                  
   176,402,651,222      cycles                    #    2.859 GHz                    
   237,532,018,384      instructions              #    1.35  insn per cycle         
   <not supported>      branches                                                    
       979,340,016      branch-misses                                               

      52.049848600 seconds time elapsed

      57.298960000 seconds user
       4.391213000 seconds sys

So seems to be primarily a stall issue. Drilling in a bit:

BASE

    37,537,472,128      stalled-cycles-backend    #    0.00% backend cycles idle    
    29,884,183,341      stalled-cycles-frontend  

DIFF

    51,543,373,659      stalled-cycles-backend    #    0.00% backend cycles idle    
    29,460,067,083      stalled-cycles-frontend

So looks like a backend stall. Was not able to pin this down further, even following approach here: Arm Neoverse N1 Core: Performance Analysis Methodology) to any particular backend event. Also consulted the Arm_Neoverse_N1_PMU_Guide.

Sampling on stall_backend and annotating (DIFF on left, BASE on right), we still see the hits clustering after the BLR in the DIFF codegen (codegen is very similar to windows codegen above).

AndyAyersMS commented 2 years ago

This regression is now fixed thanks to @EgorBo and https://github.com/dotnet/runtime/pull/71534 (note right after the fix went in, we had a big widespread regression which is also now fixed):

newplot - 2022-07-11T134502 662

I still would like to better understand the root cause of this regression, but it's no longer a 7.0 issue, so will move this to future.

AndyAyersMS commented 4 months ago

We no longer benchmark that bit of hardware, so not much we can do at this point.

dotnet / runtime

Regressions in System.Text.Json.Document.Tests.Perf_EnumerateArray #67176