Addressing of memory located instruction operands for x86/x64 targets should avoid using more than 2 args

4creators commented 5 years ago

Recently @CarolEidt improved handling of HW Intrinsics addressing modes.

See https://github.com/dotnet/coreclr/issues/19550#issuecomment-476898428

However, there are still some remaining issues affecting code performance of more general nature. Namely address calculations should not use more than 2 operands. Currently generated code may have some inefficient instructions like:

       vmovupd  ymm0, ymmword ptr[rsi]
       vsubps   ymm0, ymm0, ymmword ptr[rcx+4*rdx+4]  // address calculation should be reduced to 2 ops instead of 4
       vmovups  ymmword ptr[rcx+4], ymm0
       vmovups  ymm6, ymmword ptr[rcx+4]

category:cq theme:addressing-modes skill-level:expert cost:large

4creators commented 5 years ago

@CarolEidt @fiigii @AndyAyersMS @mikedn @tannergooding

fiigii commented 5 years ago

Seems a CSE opportunity? But that would be a trade-off between AGU pressure v.s. register pressure.

tannergooding commented 5 years ago

@4creators. Can you please provide the C# code that causes this to reproduce and what memory encoding you are expecting?

Also, possibly for clarity, I believe the memory encoding given above is more readable as [rcx+rdx*4+0x4] (where rcx is base, rdx is index, 4 is the scale, and 0x4 is the offset).

4creators commented 5 years ago

@tannergooding The problem was reported with tests indicated in @CarolEidt comment: https://github.com/dotnet/coreclr/issues/19550#issuecomment-476898428

The code which reproduces this problem is as follows:

Sse2.StoreScalar(p + alignmentOffset + 2, Sse2.Subtract(v, Sse2.LoadAlignedVector128(p + offset + alignmentOffset + 4)));
...
v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle();

This is a fragment of tests written for dotnet/coreclr#22944

4creators commented 5 years ago

The performance problem which arisies below is due to the usage of more than 2 arguments for address calculation. AFAIR Intel has indicated that as long as there are only 2 arguments used the additional calculation will not affect instruction performance.

[rcx+rdx*4+0x4]

@fiigii could perhaps provide more insight on the nature of architecture limitations affecting performance here.

Regarding possibility of using CSE for elimination unnecessary calculations it could be one of the options. However, after going through LLVM code I have noticed that there is a separate optimization stage for LEA Optimization. It is executed late after whole frame is emitted including lea instructions and aims at creating addressing based on 2 arguments calculation where one common should be a frame base.

In RyuJIT frame is created after CSE so perhaps we could add additional simple optimization stage to eliminate/optimize addressing modes.

tannergooding commented 5 years ago

The problem was reported with tests indicated in @CarolEidt comment

Thanks. Reiterating the information in the issue just helps with tracking and triaging this. It also avoids the need for people to go digging back through and guessing at the context if they stumble upon the issue.

Could you also iterate what codegen you are expecting here? ~The code you provided above and the code in the comment are different, so it isn't clear what you are expecting (the code in the issue shows rcx+4 reused in three consecutive instructions, while the code in the comment shows distinct encodings for all instructions).~ Edit: Nevermind, I was looking at a different part of the comment (the one for Sse2.Store, while this looks to be for Avx2.Store)

tannergooding commented 5 years ago

For reference, I managed to find the following sections that look related to the "2 argument calculation" comments above.

Address calculations (specifically LEA):

Addressing Modes:

fiigii commented 5 years ago

AFAIR Intel has indicated that as long as there are only 2 arguments used the additional calculation will not affect instruction performance.

Actually, address calculation performance is not very clear in Intel optimization guide. But, I am sure that the AGU issue is about "indexed address" rather than "2 arguments", which means memory instructions would also have lower instruction-level-parallelism with [baseReg + indexReg] address, especially in stores. LEA seems another story (I am not sure). Compilers should always avoid generating slow LEAs that @tannergooding referred above unless register pressure or code size becomes a more critical issue.

mikedn commented 5 years ago

The performance problem which arisies below is due to the usage of more than 2 arguments for address calculation. AFAIR Intel has indicated that as long as there are only 2 arguments used the additional calculation will not affect instruction performance.

I don't see any performance problem in this issue. It's all just speculation, without a single benchmark that shows this is a problem.

However, after going through LLVM code I have noticed that there is a separate optimization stage for LEA Optimization. It is executed late after whole frame is emitted including lea instructions and aims at creating addressing based on 2 arguments calculation where one common should be a frame base.

As far as I can tell LLVM and all other compilers happily emit such address modes: https://gcc.godbolt.org/z/cQBz4n

CarolEidt commented 5 years ago

To be clear about my original comments on dotnet/runtime#10923, I was not talking about the number of addressing operands, but rather redundancy of computation. There's an unnecessary lea in this sequence, as well as an addressing mode that duplicates what's already in r14:

       lea      r14, [rdi+4*rbx]
       movsxd   rcx, r8d
       lea      r15, [rdi+4*rcx]
       vmovaps  xmm0, xmmword ptr [r15+4*rbx+16]
       vmovupd  xmm1, xmmword ptr [rsi]
       vsubps   xmm0, xmm1, xmm0
       vmovss   xmmword ptr [r14+8], xmm0
       vmovntdqa xmm6, xmmword ptr [rdi+4*rbx]

This could be:

       lea      r14, [rdi+4*rbx]
       movsxd   rcx, r8d
       vmovaps  xmm0, xmmword ptr [r14+4*rcx+16]
       vmovupd  xmm1, xmmword ptr [rsi]
       vsubps   xmm0, xmm1, xmm0
       vmovss   xmmword ptr [r14+8], xmm0
       vmovntdqa xmm6, xmmword ptr [r14]

It's unclear to me how to interpret the "avoid using more than two operands" advice. If the computation is necessary, it doesn't seem that it would be more efficient to break the computation into multiple instructions. That said, perhaps general forming (and CSE'ing) of addressing expressions could be optimized to distribute the computation such that fewer have more than two operands.

But, to reiterate, what I was getting at with my original post is that there's redundancy in the addressing modes.

CarolEidt commented 5 years ago

Like @mikedn I'd appreciate having an example that illustrates an actual (avoidable) performance issue.

mikedn commented 5 years ago

It's unclear to me how to interpret the "avoid using more than two operands" advice.

That basically only applies to LEA, not to any other memory operands. On loads, complex address modes do require an extra cycle but there's not much that can be done about that. Obviously

add ecx, 4
mov ebx, [ecx+edx*4]

too requires an extra cycle for add.

mikedn commented 5 years ago

There's an unnecessary lea in this sequence, as well as an addressing mode that duplicates what's already in r14:

I'll have to take a closer look at that, I'm chasing this kind of extra LEAs anyway. Most likely this is caused by lack of forward substitution, I've experimented with that a bit and it can get rid of some LEAs.

mikedn commented 5 years ago

There's an unnecessary lea in this sequence, as well as an addressing mode that duplicates what's already in r14:

That looks like an unfortunate interaction between CSE and address modes. The address produced by LEA is actually a CSE that's used in 2 places. The third potential use in vmovntdqa xmm6, xmmword ptr [rdi+4*rbx] is missed by CSE because it's marked as DONT_CSE exactly because it's considered an address mode candidate.

The use of DONT_CSE on address mode candidates is probably a rather blunt tool. At least in this case, it shouldn't not block a CSE use of an existing CSE def.

But it's not obvious that this is something that actually matters (so you use the address already computed by the LEA - and? what's the gain in doing that?) and getting it truly right is probably not trivial. At the end of the day it's also possible to go the other way - eliminate the LEA by folding it into the 2 address modes that use it. One instruction less. What's the trade off? One register will end up having a longer live range.

dotnet / runtime

Addressing of memory located instruction operands for x86/x64 targets should avoid using more than 2 args #12347