Open 4creators opened 5 years ago
@CarolEidt @fiigii @AndyAyersMS @mikedn @tannergooding
Seems a CSE opportunity? But that would be a trade-off between AGU pressure v.s. register pressure.
@4creators. Can you please provide the C# code that causes this to reproduce and what memory encoding you are expecting?
Also, possibly for clarity, I believe the memory encoding given above is more readable as [rcx+rdx*4+0x4]
(where rcx
is base, rdx
is index, 4
is the scale, and 0x4
is the offset).
@tannergooding The problem was reported with tests indicated in @CarolEidt comment: https://github.com/dotnet/coreclr/issues/19550#issuecomment-476898428
The code which reproduces this problem is as follows:
Sse2.StoreScalar(p + alignmentOffset + 2, Sse2.Subtract(v, Sse2.LoadAlignedVector128(p + offset + alignmentOffset + 4)));
...
v2 = Sse41.LoadAlignedVector128NonTemporal((byte*)(p + alignmentOffset)).AsSingle();
This is a fragment of tests written for dotnet/coreclr#22944
The performance problem which arisies below is due to the usage of more than 2 arguments for address calculation. AFAIR Intel has indicated that as long as there are only 2 arguments used the additional calculation will not affect instruction performance.
[rcx+rdx*4+0x4]
@fiigii could perhaps provide more insight on the nature of architecture limitations affecting performance here.
Regarding possibility of using CSE for elimination unnecessary calculations it could be one of the options. However, after going through LLVM code I have noticed that there is a separate optimization stage for LEA Optimization
. It is executed late after whole frame is emitted including lea
instructions and aims at creating addressing based on 2 arguments calculation where one common should be a frame base.
In RyuJIT frame is created after CSE so perhaps we could add additional simple optimization stage to eliminate/optimize addressing modes.
The problem was reported with tests indicated in @CarolEidt comment
Thanks. Reiterating the information in the issue just helps with tracking and triaging this. It also avoids the need for people to go digging back through and guessing at the context if they stumble upon the issue.
Could you also iterate what codegen you are expecting here? ~The code you provided above and the code in the comment are different, so it isn't clear what you are expecting (the code in the issue shows rcx+4
reused in three consecutive instructions, while the code in the comment shows distinct encodings for all instructions).~ Edit: Nevermind, I was looking at a different part of the comment (the one for Sse2.Store
, while this looks to be for Avx2.Store
)
For reference, I managed to find the following sections that look related to the "2 argument calculation" comments above.
Address calculations (specifically LEA
):
Addressing Modes
:
AFAIR Intel has indicated that as long as there are only 2 arguments used the additional calculation will not affect instruction performance.
Actually, address calculation performance is not very clear in Intel optimization guide. But, I am sure that the AGU issue is about "indexed address" rather than "2 arguments", which means memory instructions would also have lower instruction-level-parallelism with [baseReg + indexReg]
address, especially in stores.
LEA
seems another story (I am not sure). Compilers should always avoid generating slow LEAs that @tannergooding referred above unless register pressure or code size becomes a more critical issue.
The performance problem which arisies below is due to the usage of more than 2 arguments for address calculation. AFAIR Intel has indicated that as long as there are only 2 arguments used the additional calculation will not affect instruction performance.
I don't see any performance problem in this issue. It's all just speculation, without a single benchmark that shows this is a problem.
However, after going through LLVM code I have noticed that there is a separate optimization stage for LEA Optimization. It is executed late after whole frame is emitted including lea instructions and aims at creating addressing based on 2 arguments calculation where one common should be a frame base.
As far as I can tell LLVM and all other compilers happily emit such address modes: https://gcc.godbolt.org/z/cQBz4n
To be clear about my original comments on dotnet/runtime#10923, I was not talking about the number of addressing operands, but rather redundancy of computation. There's an unnecessary lea
in this sequence, as well as an addressing mode that duplicates what's already in r14
:
lea r14, [rdi+4*rbx]
movsxd rcx, r8d
lea r15, [rdi+4*rcx]
vmovaps xmm0, xmmword ptr [r15+4*rbx+16]
vmovupd xmm1, xmmword ptr [rsi]
vsubps xmm0, xmm1, xmm0
vmovss xmmword ptr [r14+8], xmm0
vmovntdqa xmm6, xmmword ptr [rdi+4*rbx]
This could be:
lea r14, [rdi+4*rbx]
movsxd rcx, r8d
vmovaps xmm0, xmmword ptr [r14+4*rcx+16]
vmovupd xmm1, xmmword ptr [rsi]
vsubps xmm0, xmm1, xmm0
vmovss xmmword ptr [r14+8], xmm0
vmovntdqa xmm6, xmmword ptr [r14]
It's unclear to me how to interpret the "avoid using more than two operands" advice. If the computation is necessary, it doesn't seem that it would be more efficient to break the computation into multiple instructions. That said, perhaps general forming (and CSE'ing) of addressing expressions could be optimized to distribute the computation such that fewer have more than two operands.
But, to reiterate, what I was getting at with my original post is that there's redundancy in the addressing modes.
Like @mikedn I'd appreciate having an example that illustrates an actual (avoidable) performance issue.
It's unclear to me how to interpret the "avoid using more than two operands" advice.
That basically only applies to LEA, not to any other memory operands. On loads, complex address modes do require an extra cycle but there's not much that can be done about that. Obviously
add ecx, 4
mov ebx, [ecx+edx*4]
too requires an extra cycle for add
.
There's an unnecessary lea in this sequence, as well as an addressing mode that duplicates what's already in r14:
I'll have to take a closer look at that, I'm chasing this kind of extra LEAs anyway. Most likely this is caused by lack of forward substitution, I've experimented with that a bit and it can get rid of some LEAs.
There's an unnecessary lea in this sequence, as well as an addressing mode that duplicates what's already in r14:
That looks like an unfortunate interaction between CSE and address modes. The address produced by LEA is actually a CSE that's used in 2 places. The third potential use in vmovntdqa xmm6, xmmword ptr [rdi+4*rbx]
is missed by CSE because it's marked as DONT_CSE
exactly because it's considered an address mode candidate.
The use of DONT_CSE
on address mode candidates is probably a rather blunt tool. At least in this case, it shouldn't not block a CSE use of an existing CSE def.
But it's not obvious that this is something that actually matters (so you use the address already computed by the LEA
- and? what's the gain in doing that?) and getting it truly right is probably not trivial. At the end of the day it's also possible to go the other way - eliminate the LEA by folding it into the 2 address modes that use it. One instruction less. What's the trade off? One register will end up having a longer live range.
Recently @CarolEidt improved handling of HW Intrinsics addressing modes.
See https://github.com/dotnet/coreclr/issues/19550#issuecomment-476898428
However, there are still some remaining issues affecting code performance of more general nature. Namely address calculations should not use more than 2 operands. Currently generated code may have some inefficient instructions like:
category:cq theme:addressing-modes skill-level:expert cost:large