Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[RegAllocGreedy] Folding memory operand alone gives many more uncoalesced copies #36385

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR37412
Status NEW
Importance P enhancement
Reported by Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on 2018-05-11 00:05:58 -0700
Last modified on 2020-06-10 11:22:34 -0700
Version trunk
Hardware PC Linux
CC aivchenk@gmail.com, llvm-bugs@lists.llvm.org, marina.yatsina@intel.com, paulsson@linux.vnet.ibm.com, quentin.colombet@gmail.com, uweigand@de.ibm.com, wmi@google.com
Fixed by commit(s)
Attachments latest.patch (7122 bytes, text/plain)
asidiff_trim (811102 bytes, text/plain)
AHIMuxK_foldmem.patch (7658 bytes, text/plain)
AHIMuxK_foldmem.patch (7693 bytes, text/plain)
foldMemImp.patch (7684 bytes, text/plain)
Blocks
Blocked by
See also
Recently it was discovered that an opcode (AHIMux) was not specifically handled
the same way as AHI is in SystemZInstrInfo::foldMemoryOperandImpl(), which it
should be since AHIMux (addition of immediate) also can be replaced with ASI
(add immediate to memory), in case the register operand is spilled. AHIMux is
sort of a sibling opcode to AHI that can use either high or low 32-bit
registers.

This was a simple to fix by just adding a check for that opcode in this method.
However, when doing so I found that even though my test case now had a few ASIs
instead of reload-add-save (which was nice), there were now +20 more COPYs that
were not colesced! That was quite considerable, given the quite small test case.

This happened with the very test case that was made for the ASI, which was
intended to use enough registers to cause the spilling, so that ASI would be
used. Basically it has 32 volatile loads with a conditional branch past a block
that adds an immediate to the loaded values, and in the final block the
resulting value is stored:

entry:
  %val0 = load volatile i32 , i32 *%ptr
  ...
  %val31 = load volatile i32 , i32 *%ptr
  %test = icmp ne i32 %sel, 0
  br i1 %test, label %add, label %store

add:
  %add0 = add i32 %val0, 127
  ...
  %add31 = add i32 %val31, 127
  br label %store

store:
  %new0 = phi i32 [ %val0, %entry ], [ %add0, %add ]
  ...
  %new31 = phi i32 [ %val31, %entry ], [ %add31, %add ]
  store volatile i32 %new0, i32 *%ptr
  ...
  store volatile i32 %new31, i32 *%ptr
  ret void

After "Simple Register Coalescing", the MachineFunction (note: no COPYs) looks
pretty much like

bb.0.entry:
%98:grx32bit = LMux %96:addr64bit, 0, $noreg :: (volatile load 4 from %ir.ptr)
...
%129:grx32bit = LMux %96:addr64bit, 0, $noreg :: (volatile load 4 from %ir.ptr)
CHIMux %97:gr32bit, 0, implicit-def $cc
BRC 14, 6, %bb.1, implicit killed $cc
J %bb.2

bb.1:
%98:grx32bit = AHIMux %98:grx32bit, 127, implicit-def dead $cc
...
%129:grx32bit = AHIMux %129:grx32bit, 127, implicit-def dead $cc

bb.2:
STMux %98:grx32bit, %96:addr64bit, 0, $noreg :: (volatile store 4 into %ir.ptr)
...
STMux %129:grx32bit, %96:addr64bit, 0, $noreg :: (volatile store 4 into %ir.ptr)
Return

Greedy will now run out of registers and split live ranges and insert COPYs
like (without ASI patch):

bb.0.entry:
%132:grx32bit = LMux %96:addr64bit, 0, $noreg :: (volatile load 4 from %ir.ptr)
...
%221:grx32bit = LMux %96:addr64bit, 0, $noreg :: (volatile load 4 from %ir.pt
CHIMux %97:gr32bit, 0, implicit-def $cc
BRC 14, 6, %bb.1, implicit killed $cc
J %bb.1

bb.3:
%137:grx32bit = COPY %136:grx32bit
... (23 COPYs)
%203:grx32bit = COPY %202:grx32bit
J %bb.2

bb.1:
%137:grx32bit = COPY %136:grx32bi
...
%203:grx32bit = COPY %202:grx32bit
%137:grx32bit = AHIMux %137:grx32bit, 127, implicit-def dead $cc
...
%222:grx32bit = AHIMux %222:grx32bit, 127, implicit-def dead $cc

Note that the COPYs in bb.1 here lies before the AHIMuxes. For some reason,
when ASIs are used in bb.1, the COPYs now instead are inserted *after* the
AHIMuxes:

bb.1.add:
ASI %stack.0, 0, 127, implicit-def dead $cc :: (store 4 into %stack.0)
%135:grx32bit = AHIMux %135:grx32bit, 127, implicit-def dead $cc
... (also 3 more ASIs)
%216:grx32bit = AHIMux %216:grx32bit, 127, implicit-def dead $cc
%220:grx32bit = COPY %219:grx32bit
...
%136:grx32bit = COPY %135:grx32bit

Unfortunately, after "Virtual Register Rewriter"all those COPYs are still
there, while without the ASIs they are gone!

I find this odd and feel I would like some advice on what to do with this. Is
there a reason that the COPYs are now inserted in a different position, or is
this just a bug?

Attached is a side-to-side diff with '-print-after-all -debug-only=regalloc'
(master to the left, ASIs to the right)
Quuxplusone commented 6 years ago

Attached asidiff_trim (811102 bytes, text/plain): side-to-side diff of debug dumps (unpatched <> patched)

Quuxplusone commented 6 years ago
It seems the interesting difference comes in RS_Split cascade 28, where the
original code makes this decision:

Split for compact region in 2 bundles, intv 1.
splitAroundRegion with 2 globals.

while the patched code makes this decision:

Best split candidate of vreg %129  may not cause bad eviction chain
Split for $r3l in 1 bundles, intv 1.
Split for compact region in 1 bundles, intv 2.
splitAroundRegion with 3 globals.

Note the debug output says "may not cause bad eviction chain", but the
resulting set of COPYs does in fact look suspiciously like a bad eviction chain
as described in the comments in RegAllocGreedy.cpp ...
Quuxplusone commented 6 years ago
@Marina: this test case does indeed seem related to splitCanCauseEvictionChain,
and since you are the expert here, I would like to ask if you could share some
advice?

I have not been able to figure out yet exactly which evictions are bad here,
but one observation on this function is that it isn't really sensible to try to
evict any register for the sake of another, since they all overlap and are
practically identically structured.

The spill weights themselves are extremely small and very similar. I tried
to add a heuristic like "don't evict if intervals are very similar" like

diff --git a/lib/CodeGen/RegAllocGreedy.cpp b/lib/CodeGen/RegAllocGreedy.cpp
index 182d966..daf8ed7 100644
--- a/lib/CodeGen/RegAllocGreedy.cpp
+++ b/lib/CodeGen/RegAllocGreedy.cpp
@@ -842,7 +842,7 @@ bool RAGreedy::shouldEvict(LiveInterval &A, bool IsHint,
   if (CanSplit && IsHint && !BreaksHint)
    return true;

-  if (A.weight > B.weight) {
+  if (A.weight > B.weight && ((A.weight - B.weight) > 0.001)) {
     LLVM_DEBUG(dbgs() << "should evict: " << B << " w= " << B.weight << '\n');
          return true;
}

This made all the evictions for this test case to never happen in the first
place, and the result was a general improvement, as expected with more folded
operands.

Since checking specialized cases for eviction is quite a task, I wonder if my
experiment could be useful somehow as a simplistic improvement? I wonder if
there is a limit (or level of similarity) here where it doesn't make sense to
go through the trouble of evicting one live range for the other?

Either way, this indicates further as Uli suggested that these evictions are
bad and should not happen.
Quuxplusone commented 6 years ago
Currently splitCanCauseEvictionChain identifies 2 very specific scenarios I've
encountered, the scenario you're seeing here is probably not one of them.

From what I see in the log both the "before" and "after" code seem to have a
bad eviction chain (the chain of copies) that moved from one place to another
in %bb.2 due to the changes in the memory folding.
I hope to look into this a bit deeper in the next few days and identify what is
the exact scenario that caused this chain.

I fear the "shouldEvict" heuristic you suggested may be too intrusive and
affects many other decisions. We must first pin point the scenario and then
decide how to avoid it.

Regarding the difference for vreg %129 the code "before" could not find a way
to split for $r31 because of previous assignments, so it chose to do a split
that tries to create compact regions and not "fit" it the best way it can to
the "empty" parts of a $r31 (more accurately, splitting for $r31 wasn't
profitable).
In the "after" code the assignments to $r31 have changed, and now there is a
way to split %129 to fit $r31 (in a profitable way). When Greedy tries to
estimate the split cost it checks if the split for $r31 is one of the known
"bad" scenarios, but your root cause is probably different, this is why it says
"Best split candidate of vreg %129  may not cause bad eviction chain".

The best way to approach it is to identify the first "bad" COPY created, which
seems to be the decision to split vreg %99.
It creates %136 and %137 split artifacts for the "before" code and %135 and
%136 split artifacts for the "after" code.
Then we need to investigate the chain affect that happened here and understand
how we can better influence this decision.

I'll keep you updated in the next few days with what I'm able to find.
Quuxplusone commented 6 years ago

Any news on this, Marina?

Quuxplusone commented 6 years ago
Unfortunately I did not have a chance to look at the bug in-deep yet.
I hope to find the time next week.
Quuxplusone commented 6 years ago

Hi Marina, I don't want to rush you or anything, but just give a friendly reminder that we are looking forward to resolving this issue... /Jonas

Quuxplusone commented 6 years ago
Hi Jonas,

Sorry for not being able to address this issue yet. Usually these RA issues are
quite complex and take some time to figure out in depth. I will resume looking
at this over the weekend, but unfortunately my current main focus is a non llvm
related project, therefore, it may take a while until I can upload a patch that
identifies this bad eviction scenario.

There's another way to partially address such issues - you can add a cleanup
pass.
At one time Quentin Colombet suggested doing a cleanup of such copies in the
MachineCopyPropagation pass, and this may be indeed a good idea to do
regardless of improving RA. This does not solve RA's bad decisions, but if
these copies are contained in a single basic block, they can be cleaned up
there.

Thanks,
Marina
Quuxplusone commented 6 years ago

Hi Marina,

I certainly understand the complexity of this issue!

I wouldn't mind trying to improve the MachineCopyPropagation pass, except I worry that even if I have a working patch it will get rejected with the reasoning that "RegAlloc should handle this". So before trying anything with that, I think we should have a general agreement that the test case is such that we don't expect RegAlloc to always succeed with it...

In this case it actually looks a bit like RegAlloc should handle this better, or?

Quuxplusone commented 6 years ago
Jonas, can you try https://reviews.llvm.org/D50064 to see whether it helps with
your case? The patch is not reviewed yet, but it can mitigate some of the bad
eviction decisions that were fixed before in RA e.g:

    renamable $rbx = COPY renamable killed $rax
    renamable $rcx = COPY renamable killed $rbx
    renamable $rax = COPY renamable killed $rcx
    NOOP implicit $rax
=>
    NOOP implicit $rax
Quuxplusone commented 6 years ago

Ugh, those COPYs are in different basic blocks; So current MachineCopyPropagation pass cannot do anything here, because it is BB-local only

Quuxplusone commented 6 years ago
(In reply to Alexander Ivchenko from comment #11)
> Jonas, can you try https://reviews.llvm.org/D50064 to see whether it helps
> with your case? The patch is not reviewed yet, but it can mitigate some of
> the bad eviction decisions that were fixed before in RA e.g:
>
>     renamable $rbx = COPY renamable killed $rax
>     renamable $rcx = COPY renamable killed $rbx
>     renamable $rax = COPY renamable killed $rcx
>     NOOP implicit $rax
> =>
>     NOOP implicit $rax

Hi Alexander, thanks for suggesting this, but this unfortunately did not seem
to help my test case (no improvement). /Jonas(In reply to Alexander Ivchenko
from comment #12)
> Ugh, those COPYs are in different basic blocks; So current
> MachineCopyPropagation pass cannot do anything here, because it is BB-local
> only

Hi Alexander,

I was just about to say thanks for suggesting this, but this unfortunately did
not help my test case, as you predicted.

/Jonas
Quuxplusone commented 6 years ago

ping. Anyone has the time for this?

Quuxplusone commented 5 years ago

ping! The same test case still shows the same problem...

Quuxplusone commented 5 years ago

I might be able to look at it next week, unless Marina beats me at it.

Quuxplusone commented 5 years ago

It'll at to wait one or two more weeks :(

Quuxplusone commented 5 years ago

Hi all,

I finally had a look, but I am puzzled as I am not sure what to look at!

The compiler with and without the attached patch produces the exact same assembly for test/CodeGen/SystemZ/fold-memory-op-impl-02.ll and none of them end up producing asi.

Failing Tests (1):
    LLVM :: CodeGen/SystemZ/fold-memory-op-impl-02.ll

  Unexpected Failures: 1

Does the patch need to be rebased?

Quuxplusone commented 6 years ago

Attached latest.patch (7122 bytes, text/plain): patch: ASI enabled for AHIMux + the test case

Quuxplusone commented 5 years ago

Attached AHIMuxK_foldmem.patch (7658 bytes, text/plain): Updated patch for AHIMuxK instruction

Quuxplusone commented 4 years ago
ping

It would still be nice if someone with regalloc greedy know-how could address
this strange regression. I see on SystemZ/SPEC'17 that this patch does have a
value if applied:

asi    :                13226                13686     +460
ahi    :                97161                96712     -449
l      :               229532               229086     -446
st     :               177344               176900     -444
Quuxplusone commented 4 years ago

I'll try to take a look at this one this week or next week.

Quuxplusone commented 4 years ago

Attached AHIMuxK_foldmem.patch (7693 bytes, text/plain): rebased patch with test case

Quuxplusone commented 4 years ago

Attached foldMemImp.patch (7684 bytes, text/plain): rebased patch with test case

Quuxplusone commented 4 years ago
TL;DR Didn't find anything useful yet!

Quick update here.
The fact that we get different allocation is expected given the folding creates
different live-intervals and thus, changes the order in which things are
processed.

Looking at the resulting code, we indeed seem to hit a case of eviction chain
but I haven't dug enough to see what we should do to have the mechanism that
Marina added to kick in.

At first glance, it seems the live-intervals don't have the shape that is
considered to create eviction chain. In particular, it seems the candidates for
split are never both live-in *and* live-out of a bundle.
Quuxplusone commented 4 years ago
(In reply to Marina Yatsina from comment #5)
> Currently splitCanCauseEvictionChain identifies 2 very specific scenarios
> I've encountered, the scenario you're seeing here is probably not one of
> them.

I think this is exactly what is happening. I think Marina's heuristic detects
thing with copy that goes back and forth:
dst1 = src1
dst2 = src2
dst3 = src3
...
src3 = dst3
src2 = dst2
src1 = dst1

Whereas in that case, the copies are set to join live ranges:

then:
dst1 = src1
dst2 = src2
dst3 = src3
br end

else:
dst1 = src1
dst2 = src2
dst3 = src3
br end

end:
use of dst1, dst2, dst3

At this point, I would need to spend some time getting into Marina's heuristic
to see how we can fix this. I unfortunately don't have the time for that now.

Marina, if you can take a look that would be great.