[ARM64] Performance regression: Sorting arrays of primitive types

adamsitnik commented 4 years ago

After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to sorting.

It can be related to the fact that we have moved the sorting implementation from native to managed code (cc @stephentoub).

@DrewScoggins is there any way to see the full historical data for ARM64?

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'System.Collections.Sort<Int32>.Array' 'System.Collections.Sort<Int32>.List'

System.Collections.Sort<Int32>.Array(Size: 512)

Result	Base	Diff	Ratio	Alloc Delta	Modality	Operating System	Bit	Processor Name	Base V	Diff V
Same	4500.17	4250.27	1.06	+0		Windows 10.0.19041.388	X64	AMD Ryzen 9 3900X	3.1.6	5.0.20.41714
Faster	4129.79	3573.10	1.16	+0	several?	Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.40203
Faster	4197.36	3672.08	1.14	+0		Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.40416
Faster	5247.31	4572.02	1.15	+0	multimodal	Windows 10.0.19041.450	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	3.1.6	5.0.20.40416
Same	6560.47	6483.01	1.01	+0	bimodal	Windows 10.0.19041.450	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)	3.1.6	5.0.20.40416
Same	3315.74	3404.51	0.97	+0		Windows 10.0.19042	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	3.1.6	5.0.20.40416
Slower	4821.34	8602.70	0.56	+0	several?	Windows 10.0.19041.450	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)	3.1.6	5.0.20.41714
Faster	6214.65	3992.76	1.56	+0		ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.40203
Faster	6441.58	4187.34	1.54	+0		manjaro	X64	Intel Core i7-4771 CPU 3.50GHz (Haswell)	3.1.6	5.0.20.41714
Same	5831.01	5518.00	1.06	+0	bimodal	pop 20.04	X64	Intel Core i7-6600U CPU 2.60GHz (Skylake)	3.1.6	5.0.20.41714
Same	5602.64	5486.66	1.02	+0	bimodal	alpine 3.11	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	3.1.6	5.0.20.41714
Same	6178.62	5918.08	1.04	+0	bimodal	ubuntu 18.04	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	3.1.6	5.0.20.40416
Slower	16288.81	27067.03	0.60	+0	bimodal	ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	16150.43	27030.75	0.60	+0	bimodal	ubuntu 16.04	Arm64	Unknown processor	3.1.7	5.0.20.41714
Slower	16139.02	26145.59	0.62	+0		ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	14011.07	17551.48	0.80	+0		ubuntu 18.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Faster	7375.30	4817.14	1.53	+0	several?	Windows 10.0.19041.450	X86	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	3.1.6	5.0.20.40416
Same	7244.05	7548.09	0.96	+0	several?	Windows 10.0.18363.1016	Arm	Microsoft SQ1 3.0 GHz	3.1.6	5.0.20.40416
Faster	7932.56	5120.47	1.55	+0		macOS Catalina 10.15.6	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)	3.1.6	5.0.20.41714
Faster	7060.74	4554.61	1.55	+0		macOS Catalina 10.15.6	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)	3.1.6	5.0.20.41714
Faster	7145.76	4761.73	1.50	+0		macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	3.1.6	5.0.20.40203

System.Collections.Sort<Int32>.List(Size: 512)

Result	Base	Diff	Ratio	Alloc Delta	Modality	Operating System	Bit	Processor Name	Base V	Diff V
Faster	4520.64	4071.51	1.11	+0		Windows 10.0.19041.388	X64	AMD Ryzen 9 3900X	3.1.6	5.0.20.41714
Faster	4332.89	3567.66	1.21	+0		Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.40203
Faster	6967.61	3518.82	1.98	+0		Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.40416
Faster	5015.13	4448.40	1.13	+0	bimodal	Windows 10.0.19041.450	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	3.1.6	5.0.20.40416
Same	7293.88	7863.56	0.93	+0	several?	Windows 10.0.19041.450	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)	3.1.6	5.0.20.40416
Same	3339.08	3267.09	1.02	+0		Windows 10.0.19042	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	3.1.6	5.0.20.40416
Slower	4806.28	8450.61	0.57	+0	bimodal	Windows 10.0.19041.450	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)	3.1.6	5.0.20.41714
Faster	6222.44	3938.96	1.58	+0		ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	3.1.6	5.0.20.40203
Faster	6146.14	4254.57	1.44	+0		manjaro	X64	Intel Core i7-4771 CPU 3.50GHz (Haswell)	3.1.6	5.0.20.41714
Same	5834.00	5493.04	1.06	+0		pop 20.04	X64	Intel Core i7-6600U CPU 2.60GHz (Skylake)	3.1.6	5.0.20.41714
Same	5571.07	6275.42	0.89	+0	several?	alpine 3.11	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	3.1.6	5.0.20.41714
Same	6217.53	6206.80	1.00	+0	multimodal	ubuntu 18.04	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	3.1.6	5.0.20.40416
Slower	16281.08	27217.40	0.60	+0	bimodal	ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	16329.85	27111.99	0.60	+0		ubuntu 16.04	Arm64	Unknown processor	3.1.7	5.0.20.41714
Slower	16395.71	26771.27	0.61	+0		ubuntu 16.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Slower	7791.41	10397.92	0.75	+0	several?	ubuntu 18.04	Arm64	Unknown processor	3.1.6	5.0.20.41714
Faster	7371.80	4824.34	1.53	+0		Windows 10.0.19041.450	X86	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	3.1.6	5.0.20.40416
Same	7002.96	7065.54	0.99	+0	several?	Windows 10.0.18363.1016	Arm	Microsoft SQ1 3.0 GHz	3.1.6	5.0.20.40416
Faster	7976.12	5098.72	1.56	+0		macOS Catalina 10.15.6	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)	3.1.6	5.0.20.41714
Faster	7114.58	4576.65	1.55	+0		macOS Catalina 10.15.6	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)	3.1.6	5.0.20.41714
Faster	7157.62	4735.35	1.51	+0		macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	3.1.6	5.0.20.40203

/cc @JulieLeeMSFT

category:cq theme:needs-triage skill-level:expert cost:large

AndyAyersMS commented 4 years ago

What's the difference between the two tables? Is it just two different sets of runs?

Never mind -- List vs Array.

AndyAyersMS commented 4 years ago

I'll investigate.

AndyAyersMS commented 4 years ago

I can repro this on an Rpi4:

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 20.04
Unknown processor
.NET Core SDK=5.0.100-rc.1.20422.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.42118, CoreFX 5.0.20.42118), Arm64 RyuJIT
  Job-HPEYTJ : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), Arm64 RyuJIT
  Job-LNVBZK : .NET Core 5.0.0 (CoreCLR 5.0.20.42118, CoreFX 5.0.20.42118), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  InvocationCount=5000  
IterationTime=250.0000 ms  MaxIterationCount=20  MinIterationCount=15  
UnrollFactor=1  WarmupCount=1  

| Method |        Job |       Runtime |     Toolchain | Size |     Mean |    Error |   StdDev |   Median |      Min |      Max | Ratio | RatioSD | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------- |----------- |-------------- |-------------- |----- |---------:|---------:|---------:|---------:|---------:|---------:|------:|--------:|------:|------:|------:|----------:|
|  Array | Job-HPEYTJ | .NET Core 3.1 | netcoreapp3.1 |  512 | 26.97 us | 0.352 us | 0.329 us | 26.86 us | 26.43 us | 27.56 us |  1.00 |    0.00 |     - |     - |     - |         - |
|  Array | Job-LNVBZK | .NET Core 5.0 | netcoreapp5.0 |  512 | 35.61 us | 0.196 us | 0.184 us | 35.62 us | 35.29 us | 35.90 us |  1.32 |    0.02 |     - |     - |     - |         - |
|        |            |               |               |      |          |          |          |          |          |          |       |         |       |       |       |           |
|   List | Job-HPEYTJ | .NET Core 3.1 | netcoreapp3.1 |  512 | 28.54 us | 1.530 us | 1.762 us | 27.54 us | 27.12 us | 31.89 us |  1.00 |    0.00 |     - |     - |     - |         - |
|   List | Job-LNVBZK | .NET Core 5.0 | netcoreapp5.0 |  512 | 36.15 us | 0.417 us | 0.390 us | 36.20 us | 34.91 us | 36.51 us |  1.25 |    0.08 |     - |     - |     - |       1 B |

AndyAyersMS commented 4 years ago

Here's a bit more data, for the array case (from an extracted test, via perf, doing the same amount of work for each runtime). As you can see the cycles moved from native methods to jitted methods, so comparative analysis will be a bit trickier than normal.

;; 3.1

Samples: 8K of event 'cycles', Event count (approx.): 3065502893
Overhead  Command  Shared Object               Symbol
  45.12%  dotnet   libcoreclr.so               [.] ArrayHelpers<int>::PickPivotAndPartition
  19.97%  dotnet   libcoreclr.so               [.] ArrayHelpers<int>::IntroSort
   4.41%  dotnet   libcoreclr.so               [.] ArrayNative::GetLowerBound
   2.49%  dotnet   [unknown]                   [k] 0xffffc6a925388e04
   1.61%  dotnet   [unknown]                   [k] 0xffffc6a924af2194

;; 5.0

Samples: 11K of event 'cycles', Event count (approx.): 4235451510
Overhead  Command          Shared Object               Symbol
  41.31%  dotnet           [JIT] tid 196901            [.] int32 [System.Private.CoreLib] System.Collections.Generic.GenericArraySortHelper`1[System.Int32]::PickPivotAndPartition(valuetype System.Span`1<!0>)[OptimizedTier1]
  22.38%  dotnet           [JIT] tid 196901            [.] void [System.Private.CoreLib] System.Collections.Generic.GenericArraySortHelper`1[System.Int32]::InsertionSort(valuetype System.Span`1<!0>)[OptimizedTier1]
   3.30%  dotnet           [JIT] tid 196901            [.] void [System.Private.CoreLib] System.Buffer::Memmove(uint8&,uint8&,native uint)[OptimizedTier1]
   2.47%  dotnet           [JIT] tid 196901            [.] void [System.Private.CoreLib] System.Collections.Generic.GenericArraySortHelper`1[System.Int32]:
   2.24%  dotnet           [JIT] tid 196901            [.] void [System.Private.CoreLib] System.Array::Copy(class System.Array,class System.Array,int32)[Op
   1.66%  dotnet           [unknown]                   [k] 0xffffc6a925388e04
   1.08%  dotnet           [unknown]                   [k] 0xffffc6a924af2194

Total cycle difference per perf is about 1.17b, or around 30%, this compares well to the BDN ratios above.

Seems likely the top two methods in each case perform similar functions, so we'll compare them.

Method	Version	Cycles
PickPivotAndPartition	3.1	1.38 b
PickPivotAndPartition	5.0	1.75 b
IntroSort	3.1	0.61 b
IntroSort	3.1	0.95 b

Both methods are slower in 5.0; the slowdown in IntroSort is relatively more pronounced; collectively they account for about 0.71b or around 2/3 of the perf difference we observe. The remaining 0.46b cycles diff is harder to pin down.

Let's focus first on IntroSort... (details forthcoming).

AndyAyersMS commented 4 years ago

[edit: pasted in right version of C# InsertionSort .... see #35297 for how we ended up with this]

Here's the codegen for the innermost loop of InsertionSort, which is the hottest part of the respective methods (for native codegen it gets inlined into IntroSort).

;; InsertionSort, innermost loop

;; 3.1 (via perf; Clang codegen)

;; 0x1e4

 17.05  40cd90:   ldr   w13, [x19, w12, sxtw #2]
        40cd94:   cmp   w11, w13    
        40cd98: ? b.ge  40cdb0 <ArrayHelpers<int>::IntroSort(int*, int*, int, int, int)+0x204>  // b.tcont

 26.26  40cd9c:   add   w14, w12, #0x1
        40cda0:   sub   w12, w12, #0x1
        40cda4:   cmp   w12, w21    
  3.47  40cda8:   str   w13, [x19, w14, sxtw #2]
        40cdac: ? b.ge  40cd90 <ArrayHelpers<int>::IntroSort(int*, int*, int, int, int)+0x1e4>  // b.tcont

;; 5.0 (via jit disasm)

                        ;; bbWeight=4    PerfScore 38.00
G_M6103_IG04:
        910043A3          add     x3, fp, #16   // [V00 arg0]    ;; hoistable
        F9400063          ldr     x3, [x3]                       ;; hoistable
        11000405          add     w5, w0, #1                     ;; address mode
        93407CA5          sxtw    x5, w5                         ;;  + iv widen
        D37EF4A5          lsl     x5, x5, #2                     ;; ...
        8B0300A3          add     x3, x5, x3                     ;; ...
        910043A5          add     x5, fp, #16   // [V00 arg0]    ;; redundant
        F94000A5          ldr     x5, [x5]                       ;; redundant
        B86468A4          ldr     w4, [x5, x4]                   ;; redundant around backedge           
        B9000064          str     w4, [x3]
        51000400          sub     w0, w0, #1
                        ;; bbWeight=8    PerfScore 112.00
G_M6103_IG05:
        7100001F          cmp     w0, #0
        5400010B          blt     G_M6103_IG07
                        ;; bbWeight=32    PerfScore 48.00
G_M6103_IG06:
        910043A3          add     x3, fp, #16   // [V00 arg0]    ;; redundant
        F9400063          ldr     x3, [x3]                       ;; redundant
        93407C04          sxtw    x4, w0                         ;; address mode
        D37EF484          lsl     x4, x4, #2                     ;;  + iv widen
        B8646863          ldr     w3, [x3, x4]
        6B01007F          cmp     w3, w1
        54FFFDAC          bgt     G_M6103_IG04

We see the jitted version emits quite a few more instructions, and a number of things stand out. For reference, here's the C# source:

private static void InsertionSort(Span<T> keys)
{
    for (int i = 0; i < keys.Length - 1; i++)
    {
        T t = Unsafe.Add(ref MemoryMarshal.GetReference(keys), i + 1);

        int j = i;
        while (j >= 0 && (t == null || LessThan(ref t, ref Unsafe.Add(ref MemoryMarshal.GetReference(keys), j))))
        {
            Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = Unsafe.Add(ref MemoryMarshal.GetReference(keys), j);
            j--;
        }

        Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = t;
    }
}

and here's the native source (simplified since items is known to be NULL):

static void InsertionSort(KIND keys[], KIND items[], int lo, int hi)
{
    int i, j;
    KIND t, ti = NULL;
    for (i = lo; i < hi; i++)
    {
        j = i;
        t = keys[i + 1];
        // if(items != NULL)
        //    ti = items[i + 1];
        while (j >= lo && t < keys[j])
        {
            keys[j + 1] = keys[j];
            // if(items != NULL)
            //    items[j + 1] = items[j];
            j--;
        }
        keys[j + 1] = t;
        // if(items != NULL)
        //    items[j + 1] = ti;
    }
}

Analysis of jitted code

the jit fails to hoist and common a number of redundant loads. Each of the keys[...] expressions re-fetches the __pointer field from the span. This field is a loop invariant, but jit can't prove this. Haven't fully root caused, but could be from the simplistic aliasing model; the span data vector is written in the loop and the jit may think this write can interfere with the span fields. Native code has one less level of indirection here so no alias analysis is needed to see that the array base address is invariant. Possibly related to #9977.
the jit does not IV widen (hence the sxtws). See #7312.
the jit does a poor job taking advantage of ARM64 address modes; these are well-suited for array computations. Similar issues noted in #35618.
the jit does not recognize and optimize a "backedge CSE"; note that iteration i writes keys[i+1] and this forms a CSE with iteration i+1s read of keys[i]. This is a bit more esoteric and I don't think we have an issue open for it.
(likely minor) the jit does not schedule, so the compares are emitted just before the branches

As a result the jitted code does 3 loads in the upper block, compared to 1 for native code, and 1 load in the lower block, compared to none for native code.

So for the most part we see known issues and unfortunately none of this is actionable for 5.0. We are considering enhancing the loop optimizer for .NET 6.0, and this loop might make a good study candidate.

cc @dotnet/jit-contrib

Next I'll look at PickPivotAndPartition ...

AndyAyersMS commented 4 years ago

Here's the codegen for the inner loops of PickPivotAndPartition:


;; 3.1 (via perf; Clang)

;; 0x1a0

  2.20  40d118:   cmp   w11, w12    
        40d11c: ? b.ge  40d16c <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1f4>  // b.tcont
  3.58  40d120:   sbfiz x13, x11, #2, #32
        40d124:   sbfiz x14, x12, #2, #32
  4.48  40d128:   ldr   w15, [x0, x14]
  1.67  40d12c:   ldr   w16, [x0, x13]
  0.62  40d130:   str   w15, [x0, x13]
  1.76  40d134:   str   w16, [x0, x14]

;; 0x1c0

 12.01  40d138:   cmp   w11, w9     
        40d13c: ? b.ge  40d150 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1d8>  // b.tcont
  3.77  40d140:   add   w11, w11, #0x1
  7.81  40d144:   ldr   w13, [x0, w11, sxtw #2]
        40d148:   cmp   w13, w10    
        40d14c: ? b.lt  40d138 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1c0>  // b.tstop

;; 0x1d8

 23.09  40d150:   cmp   w12, w2     
        40d154: ? b.le  40d118 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1a0>
  6.31  40d158:   sub   w12, w12, #0x1
 14.98  40d15c:   ldr   w13, [x0, w12, sxtw #2]
        40d160:   cmp   w10, w13    
        40d164: ? b.lt  40d150 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1d8>  // b.tstop

 10.99  40d168: ? b     40d118 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1a0>

;; 5.0

G_M55484_IG09:
        EB01005F          cmp     x2, x1
        540000A2          bhs     G_M55484_IG10
        91001042          add     x2, x2, #4
        B9400045          ldr     w5, [x2]
        6B05009F          cmp     w4, w5
        54FFFF6C          bgt     G_M55484_IG09
                        ;; bbWeight=0.50 PerfScore 3.25
G_M55484_IG10:
        EB00007F          cmp     x3, x0
        540000A9          bls     G_M55484_IG11
        D1001063          sub     x3, x3, #4
        B9400065          ldr     w5, [x3]
        6B05009F          cmp     w4, w5
        54FFFF6B          blt     G_M55484_IG10
                        ;; bbWeight=4    PerfScore 26.00
G_M55484_IG11:
        EB03005F          cmp     x2, x3
        540000E2          bhs     G_M55484_IG13
                        ;; bbWeight=4    PerfScore 6.00
G_M55484_IG12:
        B9400045          ldr     w5, [x2]
        B9400066          ldr     w6, [x3]
        B9000046          str     w6, [x2]
        B9000065          str     w5, [x3]
        EB03005F          cmp     x2, x3
        54FFFDA3          blo     G_M55484_IG09

Here the codegen is pretty comparable. It's not obvious why the jitted version appears to be slower. It could be an artifact of running perf over the entire process rather than isolating the part where the process is executing the Tier1 codegen. Repeating the perf measurements with COMPlus_ReadyToRun=0 and COMPlus_TieredCompilation=0 seem to bear this out.

AndyAyersMS commented 4 years ago

So the likely suspect is codegen in InsertionSort. There's still a fair chunk of cycles unaccounted for beyond the delta for this method -- could also be that my repro sample was too short and I'm seeing impacts of time in the jit, etc.

I don't see anything we can do here for 5.0 in the jit. We might be able to rewrite the C# for InsertionSort to apply some of the above loop optimizations manually, but since this seemingly doesn't impact xArch as badly, perhaps we can hold off on that.

AndyAyersMS commented 4 years ago

Thinking about this a bit more, it seems odd that we can't promote the span argument to InsertionSort; doing so would also solve the aliasing problem. Let's see why....

*************** In fgPromoteStructs()
;  V00 arg0           struct <System.Span`1[Int32], 16> ld-addr-op
Not promoting promotable struct local V00, because lvIsParam is true and #fields = 2.

This is a jit limitation. For the ARM64 ABI the struct is passed by value, and the first thing we end up doing is spilling it to the stack:

IN002e: 000008                    str     x0, [fp,#16]  // [V00 arg0]
IN002f: 00000C                    str     x1, [fp,#24]  // [V00 arg0+0x08]

So this likely explains why we didn't see this regression on xArch; for those machines this structure is passed by reference and gets promoted, and so the code becomes more optimizable.

A potential fix / workaround is to manually promote the span. Let me try this.

CarolEidt commented 4 years ago

This is a jit limitation. For the ARM64 ABI the struct is passed by value, and the first thing we end up doing is spilling it to the stack

This was something I was hoping to fix for 5.0 (https://github.com/dotnet/runtime/pull/39326) but was unable to complete in time.

DrewScoggins commented 4 years ago

@adamsitnik here is a link to an index page with links to every tests full history.

AndyAyersMS commented 4 years ago

Looking at history for the array case, the lone point in Nov is the 3.1 baseline; we regressed and then recovered most of it. Can't readily pin down what caused the recovery as the historical data uses SDK build versions.

Current runs show fairly extreme bimodality. There may be some aspect of ARM64 micorarchitecture that we need to understand better.

newplot (1)

List case does not appear to have a 3.1 equivalent data point but has similar bimodality.

newplot (2)

AndyAyersMS commented 4 years ago

Possible workaround:

@@ -541,18 +543,20 @@ private static void DownHeap(Span<T> keys, int i, int n, int lo)

         private static void InsertionSort(Span<T> keys)
         {
-            for (int i = 0; i < keys.Length - 1; i++)
+            // Workaround for 41741. Copy span to local to allow promotion on ARM64
+            Span<T> _keys = keys;
+            for (int i = 0; i < _keys.Length - 1; i++)
             {
-                T t = Unsafe.Add(ref MemoryMarshal.GetReference(keys), i + 1);
+                T t = Unsafe.Add(ref MemoryMarshal.GetReference(_keys), i + 1);

                 int j = i;
-                while (j >= 0 && (t == null || LessThan(ref t, ref Unsafe.Add(ref MemoryMarshal.GetReference(keys), j))))
+                while (j >= 0 && (t == null || LessThan(ref t, ref Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j))))
                 {
-                    Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = Unsafe.Add(ref MemoryMarshal.GetReference(keys), j);
+                    Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j + 1) = Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j);
                     j--;
                 }

-                Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = t!;
+                Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j + 1) = t!;
             }
         }

Inner loop codegen for ARM64 becomes:

G_M6103_IG04:
        11000446          add     w6, w2, #1
        93407CC6          sxtw    x6, w6
        D37EF4C6          lsl     x6, x6, #2
        8B0100C6          add     x6, x6, x1
        B90000C5          str     w5, [x6]
        51000442          sub     w2, w2, #1
                                                ;; bbWeight=8    PerfScore 32.00
G_M6103_IG05:
        7100005F          cmp     w2, #0
        540000CB          blt     G_M6103_IG07
                                                ;; bbWeight=32    PerfScore 48.00
G_M6103_IG06:
        93407C45          sxtw    x5, w2
        D37EF4A5          lsl     x5, x5, #2
        B8656825          ldr     w5, [x1, x5]
        6B0400BF          cmp     w5, w4
        54FFFE8C          bgt     G_M6103_IG04

Which modulo IV widening and address mode formation is not too far off from native.

On a local standalone test perf is now:	Build	Time (ms)
3.1	2662
5.0 RC1	3930
Fix	3438

Though the RC1 runs are not very stable and drop down to 3400 sometimes, so there are still things here not well understood.

AndyAyersMS commented 4 years ago

BDN reports the workaround above gets back about half the perf; dovetails pretty well with my standalone data.

Next steps:

profile this version vs 3.1 to see what it says -- maybe we can pin down where the other half of the perf goes
look at impact of above change on xArch (hopefully none)

Method	Job	Runtime	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	Gen 0	Gen 1	Gen 2	Allocated
Array	Job-HKFDOX	.NET Core 3.1	netcoreapp3.1	512	26.72 us	0.262 us	0.205 us	26.75 us	26.29 us	27.04 us	0.69	-	-	-	-
Array	Job-WBTPMT	.NET Core 5.0	netcoreapp5.0	512	38.91 us	0.470 us	0.440 us	39.11 us	37.74 us	39.25 us	1.00	-	-	-	-
Array	Corerun	custom	net.50	512	33.03 us	0.120 us	0.100 us	33.01 us	32.92 us	33.21 us	0.85	-	-	-	-

AndyAyersMS commented 4 years ago

perf stat sees this as a pathlength issue, IPC is pretty even (note this the entire standalone benchmark, so includes jitting, etc)

andy@ubuntu:~/bugs/r41741$ perf stat dotnet exec /home/andy/bugs/r41741/bin/Release/net5.0/r41741.dll
Took 3906ms

 Performance counter stats for 'dotnet exec /home/andy/bugs/r41741/bin/Release/net5.0/r41741.dll':

           5220.00 msec task-clock                #    1.014 CPUs utilized          
               493      context-switches          #    0.094 K/sec                  
                 3      cpu-migrations            #    0.001 K/sec                  
             54074      page-faults               #    0.010 M/sec                  
        7784582680      cycles                    #    1.491 GHz                    
        7419443217      instructions              #    0.95  insn per cycle         
   <not supported>      branches                                                    
         164150759      branch-misses                                               

       5.146804442 seconds time elapsed

       4.846692000 seconds user
       0.373440000 seconds sys

andy@ubuntu:~/bugs/r41741$ perf stat dotnet exec /home/andy/bugs/r41741/bin/Release/netcoreapp3.1/r41741.dll
Took 2869ms

 Performance counter stats for 'dotnet exec /home/andy/bugs/r41741/bin/Release/netcoreapp3.1/r41741.dll':

           4061.60 msec task-clock                #    1.004 CPUs utilized          
               402      context-switches          #    0.099 K/sec                  
                 8      cpu-migrations            #    0.002 K/sec                  
             53856      page-faults               #    0.013 M/sec                  
        6050121690      cycles                    #    1.490 GHz                    
        5198855530      instructions              #    0.86  insn per cycle         
   <not supported>      branches                                                    
         118657198      branch-misses                                               

       4.047158027 seconds time elapsed

       3.762541000 seconds user
       0.296200000 seconds sys

andy@ubuntu:~/bugs/r41741$ ./repro.sh
Took 3469ms
 Performance counter stats for 'corerun /home/andy/bugs/r41741/r41741.dll':

           4926.49 msec task-clock                #    0.970 CPUs utilized          
              1146      context-switches          #    0.233 K/sec                  
                10      cpu-migrations            #    0.002 K/sec                  
             54949      page-faults               #    0.011 M/sec                  
        7307046870      cycles                    #    1.483 GHz                    
        6407127469      instructions              #    0.88  insn per cycle         
   <not supported>      branches                                                    
         147762325      branch-misses                                               

       5.076936778 seconds time elapsed

       4.498607000 seconds user
       0.433794000 seconds sys

so you would think differential profiling should be able to pin down the hotspots.

Branch misses are also up but don't seem to impact IPC. Not sure what to make of this.

AndyAyersMS commented 4 years ago

Have been looking at method and loop alignment impact. Here's a chart showing what happens as we vary the mod32 alignment of the top of Tier1 InsertionSort's inner loop from 0 to 28, on a couple different processors, for a standalone test (so profiling includes startup and tier0; data gathered via perf stat -r 5; data relative to 0 mod 32). Green line is time, yellow is cycles, grey dotted line is instructions executed, blue is branch mispredicts.

The slight upslope in instructions (grey line) is the impact of the extra nop padding (here placed just after the prolog); as we pad more we execute more instructions overall. But the total instruction count is pretty well constant.

The cycles vary more widely and the fluctuation in cycles is moderately well correlated with branch misses; impact of bad alignment varies. Note this isn't a completely pure experiment as modifying the alignment of one method may well impact alignment of others. But we can see that fluctuations here are in the 3-8% range.

AndyAyersMS commented 4 years ago

For the remainder of perf loss I suspect time may be going into one of the unmapped stubs?

Per dotnet/coreclr#26897 I am going to build perf so I look at annotations on jitted code.

AndyAyersMS commented 4 years ago

Haven't had any luck yet with locally built version of perf.

AndyAyersMS commented 4 years ago

Marking as blocking release, at least until I can get a clearer picture of where the perf is going in 5.0. Currently having trouble getting accurate profiles.

Using perf inject --jit to pull in jitdump data should help, but this is not yet working reliably for me.

AndyAyersMS commented 4 years ago

Took another look now that perf annotate can annotate jitted code (needed to build perf from recent linux sources). Seems like on the Qualcomm machine all the lost perf is in in InsertionSort, here are the relevant annotations (cycles):

;; 3.1 "insertion sort" inner loop (native)

   94328597 ¦40cc90:   ldr   w13, [x19,w12,sxtw #2]
   49260763 ¦40cc94:   cmp   w11, w13
   57762715 ¦40cc98: ? b.ge  40ccb0
   77756830 ¦40cc9c:   add   w14, w12, #0x1
   47159165 ¦40cca0:   sub   w12, w12, #0x1
   44848974 ¦40cca4:   cmp   w12, w21
   25660062 ¦40cca8:   str   w13, [x19,w14,sxtw #2]
   17110607 ¦40ccac: ? b.ge  40cc90
  442575592 ¦40ccb0:   add   w12, w12, #0x1    // branch latency skid?

;; 5.0 "insertion sort" inner loop (jitted)

  497165872  80:   add  x3, x29, #0x10
   32905197  84:   ldr  x3, [x3]      
             88:   add  w5, w0, #0x1  
   23609205  8c:   sxtw x5, w5        
   67146280  90:   lsl  x5, x5, #2    
   67408728  94:   add  x3, x5, x3    
             98:   add  x5, x29, #0x10
   30088023  9c:   ldr  x5, [x5]      
  133792275  a0:   ldr  w4, [x5,x4]   
             a4:   str  w4, [x3]      
   67686545  a8:   sub  w0, w0, #0x1  
   51829740  ac:   cmp  w0, #0x0      
   63269723  b0: ? b.lt d0
   25938857  b4:   add  x3, x29, #0x10
  116122893  b8:   ldr  x3, [x3]      
   58047416  bc:   sxtw x4, w0        
    4154869  c0:   lsl  x4, x4, #2    
   35269291  c4:   ldr  w3, [x3,x4]   
  162843745  c8:   cmp  w3, w1        
   11409506  cc: ? b.gt 80
 1566344914  d0:   add  x3, x29, #0x10    // branch latency skid?

Direct comparison is a bit tricky as the loops are rotated differently, basically each is two parts and the top of the native version matches the bottom of the jitted version, and vice-versa.

Short of the source hack above to copy the span, there is not much we can do for 5.0. Going to move this to 6.0.

JulieLeeMSFT commented 4 years ago

Does this happen mainly on Linux? Windows results show Faster.

AndyAyersMS commented 4 years ago

Linux arm64. Probably Windows arm64 too, but we don't track that.

AndyAyersMS commented 3 years ago

It looks like we regained the missing perf with #43870.

6.0 data now shows about 16 us newplot (16)

which is roughly where we were in the 3.1 timeframe.

The List case shows similar improvements

newplot (17)

Going to close this one as fixed.

danmoseley commented 3 years ago

Those graphs also appear to show two distinct improvements in reducing noise - presumably the loop alignment work? If so, that's nice to see.

AndyAyersMS commented 3 years ago

This is ARM64 -- I don't think we do anything special for loops on this platform yet. Not really sure why the noise level dropped here.

danmoseley commented 3 years ago

I see. @adamsitnik could it be BDN or test changes to randomize data alignment perhaps? @kunalspathak

kunalspathak commented 3 years ago

We have the "randomize data alignment" ability but haven't enabled it for any benchmark yet. Also as Andy pointed out, loop alignment techniques are only done for xarch.

danmoseley commented 3 years ago

Interesting, unexplained then. Thanks.

kunalspathak commented 3 years ago

Once I have Arm64 historical data, I will try to dig and narrow down changes.

dotnet / runtime