dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.24k stars 4.73k forks source link

peephole optimization for `lea` command #67187

Open ShuiRuTian opened 2 years ago

ShuiRuTian commented 2 years ago

Description

I am not sure whether this is a bug or known issue or just that my environment is out of time. Anyway, let the code talks:

        public struct Foo
        {
            public int a;
            public int b;
        }

        public class StructContainer
        {
            Foo foo;

            public void Update()
            {
                this.foo.a += 1;
                this.foo.b += 1;
            }
        }

        public class NormalContainer
        {
            int a;
            int b;
            public void Update()
            {
                this.a += 1;
                this.b += 1;
            }
        }

        [Benchmark]
        public StructContainer StructContainerUpdate()
        {
            var sc = new StructContainer();
            sc.Update();
            return sc;
        }

        [Benchmark]
        public NormalContainer NormalContainerUpdate()
        {
            var nc = new NormalContainer();
            nc.Update();
            return nc;
        }

Generated code:

## .NET 6.0.3 (6.0.322.12309), X64 RyuJIT
; Benchmarks.StructInline.StructContainerUpdate()
       sub       rsp,28
       mov       rcx,offset MT_Benchmarks.StructInline+StructContainer
       call      CORINFO_HELP_NEWSFAST
       lea       rdx,[rax+8]    ; we do not really need this line
       inc       dword ptr [rdx]
       lea       rdx,[rax+0C]
       inc       dword ptr [rdx]
       add       rsp,28
       ret
; Total bytes of code 36

; Benchmarks.StructInline.NormalContainerUpdate()
       sub       rsp,28
       mov       rcx,offset MT_Benchmarks.StructInline+NormalContainer
       call      CORINFO_HELP_NEWSFAST
       inc       dword ptr [rax+8]
       inc       dword ptr [rax+0C]
       add       rsp,28
       ret
; Total bytes of code 30

I know little about CLR, so I may have seriously underestimated the complexity of it. But it feels like the lea command is bit of not clever, right? The memory layout will not change. So something like this.struct1.struct2.struct3.struct4.property could always be calculated by [rax+offset](if the offset is not too large), isn't it? We does have this, so this problem is pretty limited, should be another peephole optimization I think.

Configuration

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000 Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK=6.0.201

Regression?

Data

Analysis

category:cq theme:optimization

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @JulieLeeMSFT See info in area-owners.md if you want to be subscribed.

Issue Details
### Description I am not sure whether this is a bug or known issue or just that my environment is out of time. Anyway, let the code talks: ``` cs public struct Foo { public int a; public int b; } public class StructContainer { Foo foo; public void Update() { this.foo.a += 1; this.foo.b += 1; } } public class NormalContainer { int a; int b; public void Update() { this.a += 1; this.b += 1; } } [Benchmark] public StructContainer StructContainerUpdate() { var sc = new StructContainer(); sc.Update(); return sc; } [Benchmark] public NormalContainer NormalContainerUpdate() { var nc = new NormalContainer(); nc.Update(); return nc; } ``` Generated code: ```assembly ## .NET 6.0.3 (6.0.322.12309), X64 RyuJIT ; Benchmarks.StructInline.StructContainerUpdate() sub rsp,28 mov rcx,offset MT_Benchmarks.StructInline+StructContainer call CORINFO_HELP_NEWSFAST lea rdx,[rax+8] ; we do not really need this line inc dword ptr [rdx] lea rdx,[rax+0C] inc dword ptr [rdx] add rsp,28 ret ; Total bytes of code 36 ; Benchmarks.StructInline.NormalContainerUpdate() sub rsp,28 mov rcx,offset MT_Benchmarks.StructInline+NormalContainer call CORINFO_HELP_NEWSFAST inc dword ptr [rax+8] inc dword ptr [rax+0C] add rsp,28 ret ; Total bytes of code 30 ``` I know little about CLR, so I may have seriously underestimated the complexity of it. But it feels like the `lea` command is bit of not clever, right? The memory layout will not change. So something like `this.struct1.struct2.struct3.struct4.property` could always be calculated by `[rax+offset]`(if the offset is not too large), isn't it? ### Configuration BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000 Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK=6.0.201 ### Regression? ### Data ### Analysis This might also be related to peephole optimization.
Author: ShuiRuTian
Assignees: -
Labels: `tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone: -
huoyaoyuan commented 2 years ago

This is indeed a question for the x86_64 architecture: is [rax + constant] addressing mode available for the INC instruction?

huoyaoyuan commented 2 years ago

The answer is yes: INC DWORD PTR [RAX + 8] is valid, confirmed with C++ inline asm. But I'm not sure the performance.

ShuiRuTian commented 2 years ago

@huoyaoyuan Oops, sorry for not pointing it out. We could just take a look at the generated code Benchmarks.StructInline.NormalContainerUpdate, which uses the pattern [base + offset].


edit:

Delete the example, which is wrong, it should use ref to work correctly. And after the change, the result is not impressive.


edit:

And not only inc, but also add, imul, and and or. I believe there are more.

SingleAccretion commented 2 years ago

The trees in question:

***** BB01
STMT00005 ( INL03 @ 0x000[E-] ... ??? ) <- INLRT @ 0x005[--]
               [000018] -A--G+------              *  ASG       byref 
               [000017] D----+-N----              +--*  LCL_VAR   byref  V03 tmp2         
               [000048] -----+------              \--*  ADD       byref 
               [000046] -----+------                 +--*  LCL_VAR   ref    V02 tmp1         
               [000047] -----+------                 \--*  CNS_INT   long   8 field offset Fseq[foo, a]

***** BB01
STMT00006 ( INL03 @ ??? ... ??? ) <- INLRT @ 0x005[--]
               [000025] -A-XG+------              *  ASG       int   
               [000024] *--X-+-N----              +--*  IND       int   
               [000019] -----+------              |  \--*  LCL_VAR   byref  V03 tmp2         
               [000023] ---XG+------              \--*  ADD       int   
               [000021] *--XG+------                 +--*  IND       int   
               [000020] -----+------                 |  \--*  LCL_VAR   byref  V03 tmp2         
               [000022] -----+------                 \--*  CNS_INT   int    1

And the proposition is to forward-substitute V03 into both of its uses.

This will be a little tricky because in general it is not simple to know whether it'll be profitable or not early (e. g. on targets that don't have an RMW form of ADD with support for the [base + offset] addressing mode it would be a pessimization), when the ordinary forward substitution is run, so it'd have to be done late (in lowering), probably as part of forming the RMW form itself. This in turn will run into the problem of generally not knowing whether the local in question has any downstream uses (perhaps early lowering or rationalization could do ref counting to aid in that).