[ARM64] Possible perf regression: slicing

After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to slicing.

It looks like these are ARM64 specific regressions, I was not able to reproduce it for ARM (the 32-bit variant).

Repro

git clone https://github.com/dotnet/performance.git
py ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'System.Memory.Slice*'

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04
Unknown processor
  [Host]     : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PVNQZA : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PXIHWO : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

Type	Method	Toolchain	Mean	Ratio
Slice<Byte>	SpanStart	netcoreapp3.1	3.831 ns	1.00
Slice<Byte>	SpanStart	netcoreapp5.0	2.550 ns	0.67

Slice<String>	SpanStart	netcoreapp3.1	11.526 ns	1.00
Slice<String>	SpanStart	netcoreapp5.0	16.482 ns	1.43

Slice<Byte>	SpanStartLength	netcoreapp3.1	3.782 ns	1.00
Slice<Byte>	SpanStartLength	netcoreapp5.0	3.202 ns	0.85

Slice<String>	SpanStartLength	netcoreapp3.1	11.720 ns	1.00
Slice<String>	SpanStartLength	netcoreapp5.0	16.823 ns	1.44

Slice<Byte>	ReadOnlySpanStart	netcoreapp3.1	3.801 ns	1.00
Slice<Byte>	ReadOnlySpanStart	netcoreapp5.0	2.867 ns	0.75

Slice<String>	ReadOnlySpanStart	netcoreapp3.1	9.144 ns	1.00
Slice<String>	ReadOnlySpanStart	netcoreapp5.0	16.039 ns	1.75

Slice<Byte>	ReadOnlySpanStartLength	netcoreapp3.1	3.779 ns	1.00
Slice<Byte>	ReadOnlySpanStartLength	netcoreapp5.0	3.156 ns	0.83

Slice<String>	ReadOnlySpanStartLength	netcoreapp3.1	9.279 ns	1.00
Slice<String>	ReadOnlySpanStartLength	netcoreapp5.0	16.418 ns	1.77

Slice<Byte>	MemoryStart	netcoreapp3.1	3.779 ns	1.00
Slice<Byte>	MemoryStart	netcoreapp5.0	6.418 ns	1.70

Slice<String>	MemoryStart	netcoreapp3.1	12.952 ns	1.00
Slice<String>	MemoryStart	netcoreapp5.0	25.550 ns	1.97

Slice<Byte>	MemoryStartSpan	netcoreapp3.1	6.416 ns	1.00
Slice<Byte>	MemoryStartSpan	netcoreapp5.0	10.515 ns	1.64

Slice<String>	MemoryStartSpan	netcoreapp3.1	24.265 ns	1.00
Slice<String>	MemoryStartSpan	netcoreapp5.0	33.036 ns	1.36

Slice<Byte>	MemoryStartLength	netcoreapp3.1	3.805 ns	1.00
Slice<Byte>	MemoryStartLength	netcoreapp5.0	5.899 ns	1.55

Slice<String>	MemoryStartLength	netcoreapp3.1	12.285 ns	1.00
Slice<String>	MemoryStartLength	netcoreapp5.0	18.409 ns	1.50

Slice<Byte>	MemoryStartLengthSpan	netcoreapp3.1	6.245 ns	1.00
Slice<Byte>	MemoryStartLengthSpan	netcoreapp5.0	9.975 ns	1.60

Slice<String>	MemoryStartLengthSpan	netcoreapp3.1	23.963 ns	1.00
Slice<String>	MemoryStartLengthSpan	netcoreapp5.0	31.040 ns	1.30

Slice<Byte>	ReadOnlyMemoryStart	netcoreapp3.1	3.807 ns	1.00
Slice<Byte>	ReadOnlyMemoryStart	netcoreapp5.0	6.394 ns	1.68

Slice<String>	ReadOnlyMemoryStart	netcoreapp3.1	9.371 ns	1.00
Slice<String>	ReadOnlyMemoryStart	netcoreapp5.0	22.140 ns	2.36

Slice<Byte>	ReadOnlyMemoryStartSpan	netcoreapp3.1	6.379 ns	1.00
Slice<Byte>	ReadOnlyMemoryStartSpan	netcoreapp5.0	9.150 ns	1.43

Slice<String>	ReadOnlyMemoryStartSpan	netcoreapp3.1	22.247 ns	1.00
Slice<String>	ReadOnlyMemoryStartSpan	netcoreapp5.0	32.229 ns	1.45

Slice<Byte>	ReadOnlyMemoryStartLength	netcoreapp3.1	3.733 ns	1.00
Slice<Byte>	ReadOnlyMemoryStartLength	netcoreapp5.0	6.005 ns	1.61

Slice<String>	ReadOnlyMemoryStartLength	netcoreapp3.1	8.833 ns	1.00
Slice<String>	ReadOnlyMemoryStartLength	netcoreapp5.0	16.008 ns	1.81

Slice<Byte>	ReadOnlyMemoryStartLengthSpan	netcoreapp3.1	6.087 ns	1.00
Slice<Byte>	ReadOnlyMemoryStartLengthSpan	netcoreapp5.0	9.430 ns	1.55

Slice<String>	ReadOnlyMemoryStartLengthSpan	netcoreapp3.1	22.489 ns	1.00
Slice<String>	ReadOnlyMemoryStartLengthSpan	netcoreapp5.0	29.502 ns	1.31

Slice<Byte>	MemorySpanStart	netcoreapp3.1	8.985 ns	1.00
Slice<Byte>	MemorySpanStart	netcoreapp5.0	11.703 ns	1.31

Slice<String>	MemorySpanStart	netcoreapp3.1	23.013 ns	1.00
Slice<String>	MemorySpanStart	netcoreapp5.0	23.544 ns	1.02

Slice<Byte>	MemorySpanStartLength	netcoreapp3.1	8.289 ns	1.00
Slice<Byte>	MemorySpanStartLength	netcoreapp5.0	9.989 ns	1.21

Slice<String>	MemorySpanStartLength	netcoreapp3.1	23.611 ns	1.00
Slice<String>	MemorySpanStartLength	netcoreapp5.0	23.401 ns	0.99

Slice<Byte>	ReadOnlyMemorySpanStart	netcoreapp3.1	8.519 ns	1.00
Slice<Byte>	ReadOnlyMemorySpanStart	netcoreapp5.0	11.698 ns	1.37

Slice<String>	ReadOnlyMemorySpanStart	netcoreapp3.1	19.716 ns	1.00
Slice<String>	ReadOnlyMemorySpanStart	netcoreapp5.0	22.038 ns	1.12

Slice<Byte>	ReadOnlyMemorySpanStartLength	netcoreapp3.1	6.770 ns	1.00
Slice<Byte>	ReadOnlyMemorySpanStartLength	netcoreapp5.0	10.912 ns	1.61

Slice<String>	ReadOnlyMemorySpanStartLength	netcoreapp3.1	21.624 ns	1.00
Slice<String>	ReadOnlyMemorySpanStartLength	netcoreapp5.0	22.292 ns	1.03

@kunalspathak is there any chance you could take a look at the produced assembly code and verify if this is an actual regression in code gen or not?

category:cq theme:ssa skill-level:expert cost:large

@kunalspathak please look into this.

CC @dotnet/jit-contrib

Noting down some observations, not necessarily related to the regression. For Slice.ReadOnlySpanStartLength() , we don't inline calls to Slice() because of following code: https://github.com/dotnet/runtime/blob/6072e4d3a7a2a1493f514cdf4be75a3d56580e84/src/coreclr/src/jit/importer.cpp#L1917

Another observation is in .NET 3.1, we didn't JIT SequenceEqual(), but I see that we JIT that method in .NET 5. It is contrary to the fact that it was marked with AggresiveOptimization in the past and was removed in https://github.com/dotnet/runtime/pull/32371 to get it JIT during R2R.

The benchmarks that regressed operates on Slice<string> and Slice<byte>. The calls to Slice<string> don't get inlined as expected, but the ones for Slice<byte> gets inlined. My following analysis is for benchmarks for Slice<string> particularly ReadOnlyMemoryStart() but most likely it is the cause for other benchmarks for Slice<string>. I will share my findings for benchmarks under Slice<byte> separately, once I do that analysis.

The jit assembly for Slice() is unchanged, but the regression is inside the benchmark code ReadOnlyMemoryStart(). The underlying issue is that in .NET 3.1, we do a null check for obj and if it is null, call HELPER_RUNTIMELOOKUP. If not, we will just operate on that object. This is how we generated code in .NET 3.1 for such checks:

G_M18749_IG07:
            add     x0, fp, #32 // [V01 loc0]
            mov     w2, #5
            bl      ReadOnlyMemory`1:Slice(int):struct:this
            str     x0, [fp,#16]    // [V02 loc1]
            str     x1, [fp,#24]    // [V02 loc1+0x08]
            mov     x0, x20
            ldr     x2, [x21,#16]
            cbnz    x2, G_M18749_IG08    ; <----------- This condition checks if obj != null
            movz    x1, #0xd1ffab1e
            movk    x1, #0xd1ffab1e LSL #16
            movk    x1, #0xd1ffab1e LSL #32
            bl      CORINFO_HELP_RUNTIMEHANDLE_CLASS
            mov     x2, x0

G_M18749_IG08:
            add     x1, fp, #16 // [V02 loc1]
            mov     x0, x2
            bl      Slice`1:Consume(byref)
            mov     x0, x20
            ldr     x1, [x21,#24]
            cbnz    x1, G_M18749_IG09
            movz    x1, #0xd1ffab1e
            movk    x1, #0xd1ffab1e LSL #16
            movk    x1, #0xd1ffab1e LSL #32
            bl      CORINFO_HELP_RUNTIMEHANDLE_CLASS
            mov     x1, x0

For happy path (where obj != null), we would jump to G_M18749_IG08 and use the obj. Otherwise, we would call CORINFO_HELP_RUNTIMEHANDLE_CLASS. However, in .NET 5, we have flipped the condition of this check:

G_M6359_IG14:
            add     x0, fp, #64 // [V01 loc0]
            mov     w2, #5
            bl      ReadOnlyMemory`1:Slice(int):ReadOnlyMemory`1:this
            str     x0, [fp,#48]    // [V02 loc1]
            str     x1, [fp,#56]    // [V02 loc1+0x08]
            mov     x0, x20
            ldr     x1, [x21,#24]
            cbz     x1, G_M6359_IG16    ; <----------- This condition now checks if obj == null
                        ;; bbWeight=1    PerfScore 8.50
G_M6359_IG15:
            str     x1, [fp,#32]    // [V14 tmp11]
            b       G_M6359_IG17
                        ;; bbWeight=0.25 PerfScore 0.50
G_M6359_IG16:
            movz    x1, #0xd1ffab1e
            movk    x1, #0xd1ffab1e LSL #16
            movk    x1, #0xd1ffab1e LSL #32
            bl      CORINFO_HELP_RUNTIMEHANDLE_CLASS
            str     x0, [fp,#32]    // [V14 tmp11]
                        ;; bbWeight=0.25 PerfScore 0.88
G_M6359_IG17:
            add     x1, fp, #48 // [V02 loc1]
            ldr     x0, [fp,#32]    // [V14 tmp11]
            bl      Slice`1:Consume(byref)
            mov     x0, x20
            ldr     x1, [x21,#32]
            cbz     x1, G_M6359_IG19

Now, in happy path, we check for obj == null condition and if true, would call CORINFO_HELP_RUNTIMEHANDLE_CLASS, otherwise, we would goto G_M6359_IG15 and do a jump to G_M6359_IG17. Not only we introduce extra jumps in .NET 5, we also have to spill the values on stack. The local frame size in .NET 3.1 is just 40 vs. 168 for .NET 5.0.

The branches in G_M6359_IG15 and G_M6359_IG16 are weighted same as 0.25. As per @AndyAyersMS , I tried to bump up the weight of non-call branch higher, but that doesn't change the condition. I will further see what else we can do here.

Edit:

In .NET 3.1, the condition would just have 1 arm and so it gets flipped, however in .NET 5, we have 2 arms on that condition, one coming from "expandable generic dictionaries" work that we did in .NET 5, so most like given that, bumping the non-call branch won't make a difference.

.NET 3.1 tree:

               [000163] ------------              *  STMT      void  (IL   ???...  ???)
               [000162] -AC-G-------              \--*  ASG       long  
               [000161] D------N----                 +--*  LCL_VAR   long   V13 tmp9         
               [000160] --C-G-------                 \--*  QMARK     long  
               [000157] Q-----------    if              +--*  NE        int   
               [000150] ------------                    |  +--*  LCL_VAR   long   V13 tmp9         
               [000156] ------------                    |  \--*  CNS_INT   long   0
               [000159] --C-G-------    if              \--*  COLON     long  
               [000155] --C-G------- else                  +--*  CALL help long   HELPER.CORINFO_HELP_RUNTIMEHANDLE_CLASS
               [000138] ------------ arg0                  |  +--*  LCL_VAR   long   V12 tmp8         
               [000152] ------------ arg1                  |  \--*  CNS_INT(h) long   0xd1ffab1e token
               [000158] ------------ then                  \--*  NOP       void

.NET 5.0 tree

               [000109] -AC-G+------              *  ASG       long  
               [000108] D----+-N----              +--*  LCL_VAR   long   V10 tmp7         
               [000107] --C-G+------              \--*  QMARK     long  
               [000097] J----+-N----    if           +--*  EQ        int   
               [000093] n----+------                 |  +--*  IND       long  
               [000092] -----+------                 |  |  \--*  ADD       long  
               [000090] #----+------                 |  |     +--*  IND       long  
               [000089] #----+------                 |  |     |  \--*  IND       long  
               [000088] -----+------                 |  |     |     \--*  ADD       long  
               [000086] -----+------                 |  |     |        +--*  LCL_VAR   long   V09 tmp6         
               [000087] -----+------                 |  |     |        \--*  CNS_INT   long   48
               [000091] -----+------                 |  |     \--*  CNS_INT   long   24
               [000096] -----+------                 |  \--*  CNS_INT   long   0
               [000106] --C-G+?-----    if           \--*  COLON     long  
               [000095] --C-G+?----- else               +--*  CALL help long   HELPER.CORINFO_HELP_RUNTIMEHANDLE_CLASS
               [000085] -----+?----- arg0 in x0         |  +--*  LCL_VAR   long   V09 tmp6         
               [000094] -----+?----- arg1 in x1         |  \--*  CNS_INT(h) long   0x7ffb0671e8e8 token
               [000098] n----+?----- then               \--*  IND       long  
               [000099] -----+?-----                       \--*  ADD       long  
               [000100] #----+?-----                          +--*  IND       long  
               [000101] #----+?-----                          |  \--*  IND       long  
               [000102] -----+?-----                          |     \--*  ADD       long  
               [000103] -----+?-----                          |        +--*  LCL_VAR   long   V09 tmp6         
               [000104] -----+?-----                          |        \--*  CNS_INT   long   48
               [000105] -----+?-----                          \--*  CNS_INT   long   24

Here is my analysis for benchmarks in Slice<byte>. The regression is coming from the inlined Slice<byte>() and the ctor for ReadOnlyMemory<byte> present inside Slice() as seen here.

Below is the assembly code for

Consume(memory.Slice(Size / 2)); // private const int Size = 10;

In .NET 3.1, here is disassembly. IG06 does this check and if within bounds, proceed to creating the ReadOnlyMemory() object in IG07 as seen here.


G_M45388_IG06:
        B94037B3          ldr     w19, [fp,#52] // [V01 loc0+0x0c]
        7100167F          cmp     w19, #5
        540005A3          blo     G_M45388_IG12

G_M45388_IG07:
        F94017B4          ldr     x20, [fp,#40] // [V01 loc0]
        AA1403E0          mov     x0, x20
        51001675          sub     w21, w19, #5
        2A1503E1          mov     w1, w21
        528000A2          mov     w2, #5
        F9000FA0          str     x0, [fp,#24]  // [V21 tmp18] ; this._object = _object
        B90023A2          str     w2, [fp,#32]  // [V22 tmp19] ; this._index = _index
        B90027A1          str     w1, [fp,#36]  // [V23 tmp20] ; this._length = _length
        910063A0          add     x0, fp, #24   // [V02 loc1]
        94000000          bl      Slice`1:Consume(byref)
        7100167F          cmp     w19, #5
        54000483          blo     G_M45388_IG13

In .NET 5, here is disassembly. There are more memory access in IG06 (4 vs. 7).

G_M6359_IG05:
        B9402FA0          ldr     w0, [fp,#44]  // [V23 tmp20]
        7100141F          cmp     w0, #5
        540006C3          blo     G_M6359_IG11
                        ;; bbWeight=0.50 PerfScore 1.75

G_M6359_IG06:
        B9402BA0          ldr     w0, [fp,#40]  // [V22 tmp19]
        11001400          add     w0, w0, #5  ; <-- In .NET 3.1, we constant prop value of _index and turns this to " = 5"
        B9402FA1          ldr     w1, [fp,#44]  // [V23 tmp20] ; <-- could have skipped because we already load it in IG05, the way it is skipped in .NET 3.1
        51001421          sub     w1, w1, #5
        F94013A2          ldr     x2, [fp,#32]  // [V21 tmp18]
        F9000BA2          str     x2, [fp,#16]  // [V24 tmp21] ; this._object = _object
        B9001BA0          str     w0, [fp,#24]  // [V25 tmp22] ; this._index = _index
        B9001FA1          str     w1, [fp,#28]  // [V26 tmp23] ; this._length = _length
        910043A0          add     x0, fp, #16   // [V02 loc1]
        94000000          bl      Slice`1:Consume(byref)
        B9402FA0          ldr     w0, [fp,#44]  // [V23 tmp20] ; <-- could have been skipped
        7100141F          cmp     w0, #5
        54000523          blo     G_M6359_IG11

I believe some of them are happening because we fail to constant propagate the value of _index which is zero. Instead we load the value from stack and add 5 to it. In .NET 3.1, we detected this and converted it to assigment.

.NET 3.1 dump:


***** BB05, stmt 11 (before)
N005 (  5,  6) [000118] -A------R---              *  ASG       int   
N004 (  1,  1) [000117] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2
N003 (  5,  6) [000073] ------------              \--*  ADD       int   
N001 (  3,  4) [000070] ------------                 +--*  LCL_FLD   int    V01 loc0         u:6[+8] Fseq[_index]
N002 (  1,  1) [000072] ------------                 \--*  CNS_INT   int    5

  VNApplySelectors:
    VNForHandle(_index) is $142, fieldType is int
      AX2: $142 != $143 ==> select([$283]store($281, $143, $2c1), $142) ==> select($281, $142).
      AX1: select([$241]store($281, $142, $40), $142) ==> $40.
    VNForMapSelect($380, $142):int returns $40 {IntCns 0}
  VNApplySelectors:
    VNForHandle(_index) is $142, fieldType is int
    VNForMapSelect($380, $142):int returns $40 {IntCns 0}
N001 [000070]   LCL_FLD   V01 loc0         u:6[+8] Fseq[_index] => $40 {IntCns 0}
N002 [000072]   CNS_INT   5 => $42 {IntCns 5}
N003 [000073]   ADD       => $42 {IntCns 5}
N004 [000117]   LCL_VAR   V07 tmp4         d:2 => $42 {IntCns 5}
N005 [000118]   ASG       => $42 {IntCns 5}

***** BB05, stmt 11 (after)
N005 (  5,  6) [000118] -A------R---              *  ASG       int    $42
N004 (  1,  1) [000117] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2 $42
N003 (  5,  6) [000073] ------------              \--*  ADD       int    $42
N001 (  3,  4) [000070] ------------                 +--*  LCL_FLD   int    V01 loc0         u:6[+8] Fseq[_index] $40
N002 (  1,  1) [000072] ------------                 \--*  CNS_INT   int    5 $42

.NET 5.0 dump:


***** BB04, STMT00019(before)
N005 (  3,  4) [000092] -A------R---              *  ASG       int   
N004 (  1,  1) [000091] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2
N003 (  3,  4) [000058] ------------              \--*  ADD       int   
N001 (  1,  1) [000056] ------------                 +--*  LCL_VAR   int    V10 tmp7         
N002 (  1,  2) [000057] ------------                 \--*  CNS_INT   int    5

N001 [000056]   LCL_VAR   V10 tmp7          => $282 {282}
N002 [000057]   CNS_INT   5 => $42 {IntCns 5}
N003 [000058]   ADD       => $206 {ADD($42, $282)}
N004 [000091]   LCL_VAR   V07 tmp4         d:2 => $206 {ADD($42, $282)}
N005 [000092]   ASG       => $206 {ADD($42, $282)}

***** BB04, STMT00019(after)
N005 (  3,  4) [000092] -A------R---              *  ASG       int    $206
N004 (  1,  1) [000091] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2 $206
N003 (  3,  4) [000058] ------------              \--*  ADD       int    $206
N001 (  1,  1) [000056] ------------                 +--*  LCL_VAR   int    V10 tmp7          $282
N002 (  1,  2) [000057] ------------                 \--*  CNS_INT   int    5 $42

I am still trying to see why we fail to detect _index being constant.

@CarolEidt pointed out that in .NET 3.1, this._index remains LCL_FLD and gets to SSA form while in .NET 5, we do struct promotion, but because of this condition, we don't convert it to SSA form. Because of this, we don't do constant propagation. We need to support multi-reg defs for SSA and might not be something we want to fix as part of perf regression fix. We should port this to .NET 6 (at least for benchmarks under Slice<byte>()).

Also, it turns out that the fix for Slice<string>() needs more thinking given the introduction of runtime lookup in .NET 5. We need to have a broader fix for this which should be done in .NET 6

Here are the benchmarks that regressed (taken from the description but filtered just the ones that regressed)

Class	Method	.NET5 / .NET 3.0 ratio
Slice	ReadOnlyMemoryStart	2.36
Slice	MemoryStart	1.97
Slice	ReadOnlyMemoryStartLength	1.81
Slice	ReadOnlySpanStartLength	1.77
Slice	ReadOnlySpanStart	1.75
Slice	MemoryStart	1.7
Slice	ReadOnlyMemoryStart	1.68
Slice	MemoryStartSpan	1.64
Slice	ReadOnlyMemoryStartLength	1.61
Slice	ReadOnlyMemorySpanStartLength	1.61
Slice	MemoryStartLengthSpan	1.6
Slice	MemoryStartLength	1.55
Slice	ReadOnlyMemoryStartLengthSpan	1.55
Slice	MemoryStartLength	1.5
Slice	ReadOnlyMemoryStartSpan	1.45
Slice	SpanStartLength	1.44
Slice	SpanStart	1.43
Slice	ReadOnlyMemoryStartSpan	1.43
Slice	ReadOnlyMemorySpanStart	1.37
Slice	MemoryStartSpan	1.36
Slice	ReadOnlyMemoryStartLengthSpan	1.31
Slice	MemorySpanStart	1.31
Slice	MemoryStartLengthSpan	1.3
Slice	MemorySpanStartLength	1.21
Slice	ReadOnlyMemorySpanStart	1.12
Slice	ReadOnlyMemorySpanStartLength	1.03
Slice	MemorySpanStart	1.02

Actionable item: Need to double check the numbers if we did anything in .NET 6.0 to address https://github.com/dotnet/runtime/issues/41704#issuecomment-687392166.

Here is my analysis after comparing the assembly of .NET 6 vs. .NET 3.1. Here is the diff for ReadOnlyMemory.Slice<byte> benchmark.

No zero register used

            mov     x1, #0
            str     x1, [fp,#48]    // [V133 tmp130]
            str     w1, [fp,#56]    // [V134 tmp131]
            str     w1, [fp,#60]    // [V135 tmp132]
            b       G_M31903_IG05

Update: This PR is out - #52269

Redundant ldr from same location

I noticed redundant ldr [fp, #48] and they could have been replaced with mov x1, x19. It can vary from case to case depending on the register availability, but clearly in below case, CSE could have avoided extra load from memory.

            blo     G_M31903_IG88
            ldr     x19, [fp,#48]   // [V133 tmp130]
            ldr     w1, [fp,#56]    // [V134 tmp131]
            add     w20, w1, #5
            ldr     w1, [fp,#60]    // [V135 tmp132]
            sub     w21, w1, #5
            ldr     x1, [fp,#48]    // [V133 tmp130]
            cbz     x1, G_M31903_IG13

Update: Related issue: https://github.com/dotnet/runtime/issues/6761

Repeated loading of `._object` field

We load value of ._object again and again in .NET6

            ldr     x19, [fp,#48]   // [V133 tmp130]
            ...
            ...
            str     x19, [fp,#32]   // [V136 tmp133]
            ...
            ; occurs 16 times

But in .NET 3.1 we load it in x20 once and do mov x21, x20 to store it in final destination, although instead of mov, we could have just done str x20, [fp, #24].

            ldr     x20, [fp,#40]   // [V01 loc0]
            mov     x21, x20
            ...
            str     x21, [fp,#24]   // [V149 tmp146]
            ...
            ...
            mov     x21, x20
            ...
            str     x21, [fp,#24]   // [V149 tmp146]
            ...
            occurs 16 times.

Constant propagation

Next, we still have problem of not const proping 5 in .NET 6 as discussed in https://github.com/dotnet/runtime/issues/41704#issuecomment-687392166.

            ldr     w1, [fp,#56]    // [V134 tmp131]
            add     w20, w1, #5

            ...
            occurs 16 times.

But in .NEt3.1, we const prop and directly move 5:

            mov     w0, #5
            str     x21, [fp,#24]   // [V149 tmp146]  ; ._object
            str     w0, [fp,#32]    // [V150 tmp147]

Repeatative `sub` operation

Lastly, in .NET 6, we do not CSE _length - start and do sub every time.

            ldr     w1, [fp,#60]    // [V135 tmp132]
            sub     w21, w1, #5
            ...
            str     w21, [fp,#44]   // [V138 tmp135]
            ...
            ...
            sub     w21, w1, #5
            ...
            str     w21, [fp,#44]   // [V138 tmp135]
            ... 
            occurs 15 times

But in .NET3.1, we do subtraction once and CSE the result.

            sub     w22, w19, #5
            mov     w23, w22
            ...
            str     w23, [fp,#36]   // [V151 tmp148]
            ...
            ...
            mov     w23, w22
            ...
            str     w23, [fp,#36]   // [V151 tmp148]
            ...
            15 times

I will try to investigate little more and open separate issues for each of them.

So most of the regression is happening due to CSE not working for add and sub. During value numbering, we name V22 with unique VN making it not possible to CSE Add(V22, 5) operation.


***** BB05, STMT00027(before)
N005 (  3,  4) [000138] -A------R---              *  ASG       int   
N004 (  1,  1) [000137] D------N----              +--*  LCL_VAR   int    V08 tmp5         d:2
N003 (  3,  4) [000069] ------------              \--*  ADD       int   
N001 (  1,  1) [000067] ------------                 +--*  LCL_VAR   int    V22 tmp19        
N002 (  1,  2) [000068] ------------                 \--*  CNS_INT   int    5

N001 [000067]   LCL_VAR   V22 tmp19         => $242 {242}
N002 [000068]   CNS_INT   5 => $42 {IntCns 5}
N003 [000069]   ADD       => $206 {ADD($42, $242)}
N004 [000137]   LCL_VAR   V08 tmp5         d:2 => $206 {ADD($42, $242)}
N005 [000138]   ASG       => $206 {ADD($42, $242)}

***** BB05, STMT00027(after)
N005 (  3,  4) [000138] -A------R---              *  ASG       int    $206
N004 (  1,  1) [000137] D------N----              +--*  LCL_VAR   int    V08 tmp5         d:2 $206
N003 (  3,  4) [000069] ------------              \--*  ADD       int    $206
N001 (  1,  1) [000067] ------------                 +--*  LCL_VAR   int    V22 tmp19         $242
N002 (  1,  2) [000068] ------------                 \--*  CNS_INT   int    5 $42

====================================================================================================================================================================

***** BB12, STMT00048(before)
N005 (  3,  4) [000250] -A------R---              *  ASG       int   
N004 (  1,  1) [000249] D------N----              +--*  LCL_VAR   int    V16 tmp13        d:2
N003 (  3,  4) [000181] ------------              \--*  ADD       int   
N001 (  1,  1) [000179] ------------                 +--*  LCL_VAR   int    V22 tmp19        
N002 (  1,  2) [000180] ------------                 \--*  CNS_INT   int    5

N001 [000179]   LCL_VAR   V22 tmp19         => $24d {24d}
N002 [000180]   CNS_INT   5 => $42 {IntCns 5}
N003 [000181]   ADD       => $20f {ADD($42, $24d)}
N004 [000249]   LCL_VAR   V16 tmp13        d:2 => $20f {ADD($42, $24d)}
N005 [000250]   ASG       => $20f {ADD($42, $24d)}

***** BB12, STMT00048(after)
N005 (  3,  4) [000250] -A------R---              *  ASG       int    $20f
N004 (  1,  1) [000249] D------N----              +--*  LCL_VAR   int    V16 tmp13        d:2 $20f
N003 (  3,  4) [000181] ------------              \--*  ADD       int    $20f
N001 (  1,  1) [000179] ------------                 +--*  LCL_VAR   int    V22 tmp19         $24d
N002 (  1,  2) [000180] ------------                 \--*  CNS_INT   int    5 $42

The reason is what I mentioned in https://github.com/dotnet/runtime/issues/41704#issuecomment-687392166. Since it is related to the struct multireg return, I would assign this to @sandreenko .

Multi-def SSA is a large work item that won't happen in 6.0. Marking it as Future for now.

dotnet / runtime