Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Regalloc failure due to attempted reload of full register pair where only subreg is needed #33582

Closed Quuxplusone closed 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR34610
Status RESOLVED FIXED
Importance P normal
Reported by Ulrich Weigand (uweigand@de.ibm.com)
Reported on 2017-09-14 13:16:19 -0700
Last modified on 2017-09-29 07:32:46 -0700
Version trunk
Hardware Other Linux
CC llvm-bugs@lists.llvm.org, matze@braunis.de, paulsson@linux.vnet.ibm.com, quentin.colombet@gmail.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Building the following test case with llc -mtriple=s390x-linux-gnu result in:
LLVM ERROR: ran out of registers during register allocation

define void @test(i64 %dividend, i64 %divisor) {
  %rem = urem i64 %dividend, %divisor
  call void asm sideeffect "", "{r0},{r1},{r2},{r3},{r4},{r5},{r6},{r7},{r8},{r9},{r10},{r11},{r12},{r13},{r14}"(i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i64 0, i
64 0, i64 0, i64 0, i64 0, i64 %rem)
  ret void
}

The register allocator sees the following code:

%vreg3<def,tied1> = DLGR %vreg3<tied0>, %vreg1; GR128Bit:%vreg3 GR64Bit:%vreg1
%R0D<def> = LGHI 0
%R1D<def> = LGHI 0
%R2D<def> = LGHI 0
%R3D<def> = LGHI 0
%R4D<def> = LGHI 0
%R5D<def> = LGHI 0
%R6D<def> = LGHI 0
%R7D<def> = LGHI 0
%R8D<def> = LGHI 0
%R9D<def> = LGHI 0
%R10D<def> = LGHI 0
%R11D<def> = LGHI 0
%R12D<def> = LGHI 0
%R13D<def> = LGHI 0
%R14D<def> = COPY %vreg3:subreg_h64; GR128Bit:%vreg3
INLINEASM <es:> [sideeffect] [attdialect], $0:[reguse], %R0D<kill>,
$1:[reguse], %R1D<kill>, $2:[reguse], %R2D, $3:[reguse], %R3D, $4:[reguse],
%R4D<kill>, $5:[reguse], %R5D<kill>, $6:[reguse], %R6D<kill>, $7:[reguse],
%R7D<kill>, $8:[reguse], %R8D<kill>, $9:[reguse], %R9D<kill>, $10:[reguse],
%R10D<kill>, $11:[reguse], %R11D<kill>, $12:[reguse], %R12D<kill>,
$13:[reguse], %R13D<kill>, $14:[reguse], %R14D<kill>

Note that the DLGR instruction uses a 128-bit register pair (a register with an
even number concatenated with the immediately succeeding register) as input,
holding the 128-bit dividend, and also as output, where the two registers of
the pair hold the 64-bit quotient and remainder values respectively.  These
pairs are modeled via the GR128Bit register class in SystemZ.  (The test case
does not use the quotient at all, but it will use the remainder as input to the
inline asm.)

What happens is that the (greedy) register allocator sees that GR128Bit:%vreg3
is used in both the DLGR and the COPY, and notices that no register pair can be
allocated for that whole range, as registers 0..13 are clobbered in between and
14/15 cannot be used as a pair since r15 is reserved (the stack pointer).

So the greedy allocator decides that %vreg3 needs to be spilled, and tries to
insert a reload immediately before the COPY.  However, it still reloads the
full 128-bit value, and therefore needs a register pair to temporarily hold
that reloaded value -- but again, at the point of the COPY, no register pair is
available, since registers 0..13 are live (used by the inline asm) and 14/15
cannot be used.

At this point, the allocator gives up and the error occurs.

However, looking at the actual code, it would be trivial to allocate registers,
since only the high half of %vreg3 is needed, which would fit into a single
register, in particular r14.  Unfortunately, it seems the register allocator
somehow never even considers subregs at this point.

Now I'm wondering how this is supposed to be handled.  Is this a deficiency in
the common parts of the register allocator?  Should e.g. live range splitting
split %vreg3 into one range where its full 128 bits are used, and then a second
range where only the high 64 bits are used?

Or is this a problem in the target back-end code?  For example, I notice that
there is a TII.isSubregFoldable hook, which we currently do not define -- would
this help?  (OTOH, spilling shouldn't actually be necessary at all here ...)

Any help or suggestions from register allocation experts would be welcome :-)

(Note that this particular test using the inline asm is a bit artificial, but
we actually saw this error in a large real-world test case which we cannot
share publicly as-is, but which ran into the same problem.  Instead of the
inline asm making all other registers impossible to use, they all ended up
impossible to use due to a variety of other factors caused by the surrounding
code and prior decisions.)
Quuxplusone commented 6 years ago
- This specific example seems impossible to allocate because of all those pre-
assigned registers (which we never spill AFAIK). Unless you have a very strange
calling convention or inline assembly I don't see how you can up with that many
pre-assigned registers live at the same time.
- Without all those preassigned register live at the same time, regalloc should
be able to spill some other stuff away to make room for the reload.
- You could enable TargetSubtargetInfo::enableSubRegLiveness() so the register
allocator does finer grained liveness tracking; this could make this
allocatable however it enables a whole bunch of machinery and in my experience
register tuples are so rare on CPU targets that it isn't worth the compiletime.
(It's a different stories on GPUs).
- In general you could try giving tuples register classes a higher allocation
priority so the register allocator assigns them first making it more likely
that pairs/tuples are still free. (Though this case I think is impossible to
allocate without subreg liveness as explained in point 1).
Quuxplusone commented 6 years ago

Hi Ulrich,

No, you got it right. This is a deficiency in the current reg alloc model. The spilling algorithm always works on the full live-range.

The way to work around it is to use the TargetRegisterInfo::shouldCoalesce hook to avoid creating live-ranges that are too "wide".

Quuxplusone commented 6 years ago
As Matthias explained, by returning true in subRegLivenessEnabled(), regalloc
succeeds:

********** MACHINEINSTRS **********
# Machine code for function test: NoPHIs, TracksLiveness
Function Live Ins: %R2D in %vreg0, %R3D in %vreg1

0B      BB#0: derived from LLVM BB %init
            Live Ins: %R2D %R3D
16B             %vreg1<def> = COPY %R3D; GR64Bit:%vreg1
32B             %vreg3:subreg_l64<def,read-undef> = COPY %R2D; GR128Bit:%vreg3
48B             %vreg3:subreg_h64<def> = LLILL 0; GR128Bit:%vreg3
128B            %vreg3<def,tied1> = DLGR %vreg3<tied0>, %vreg1; GR128Bit:%vreg3
GR64Bit:%vreg1
176B            %R1D<def> = LGHI 0
192B            %R3D<def> = LGHI 0
208B            %R5D<def> = LGHI 0
224B            %R7D<def> = LGHI 0
240B            %R9D<def> = LGHI 0
256B            %R11D<def> = LGHI 0
272B            %R12D<def> = LGHI 0
288B            %R13D<def> = COPY %vreg3:subreg_h64; GR128Bit:%vreg3
304B            INLINEASM <es:> [sideeffect] [attdialect], $0:[reguse],
%R1D<kill>, $1:[reguse], %R3D, $2:[reguse], %R5D<kill>, $3:[reguse],
%R7D<kill>, $4:[reguse], %R9D<kill>, $5:[reguse], %R11D<kill>, $6:[reguse],
%R12D<kill>, $7:[reguse], %R13D<kill>
320B            %R2D<def> = COPY %vreg3:subreg_h64; GR128Bit:%vreg3
336B            Return %R2D<imp-use>

[%vreg1 -> %R0D] GR64Bit
[%vreg3 -> %R2Q] GR128Bit

# *** IR Dump After Virtual Register Rewriter ***:
# Machine code for function test: NoPHIs, TracksLiveness, NoVRegs
Function Live Ins: %R2D, %R3D

0B      BB#0: derived from LLVM BB %init
            Live Ins: %R2D %R3D
16B             %R0D<def> = COPY %R3D
32B             %R3D<def> = COPY %R2D
48B             %R2D<def> = LLILL 0
128B            %R2Q<def,tied1> = DLGR %R2Q<tied0>, %R0D<kill>
176B            %R1D<def> = LGHI 0
192B            %R3D<def> = LGHI 0
208B            %R5D<def> = LGHI 0
224B            %R7D<def> = LGHI 0
240B            %R9D<def> = LGHI 0
256B            %R11D<def> = LGHI 0
272B            %R12D<def> = LGHI 0
288B            %R13D<def> = COPY %R2D
304B            INLINEASM <es:> [sideeffect] [attdialect], $0:[reguse],
%R1D<kill>, $1:[reguse], %R3D, $2:[reguse], %R5D<kill>, $3:[reguse],
%R7D<kill>, $4:[reguse], %R9D<kill>, $5:[reguse], %R11D<kill>, $6:[reguse],
%R12D<kill>, $7:[reguse], %R13D<kill>
336B            Return %R2D<imp-use>

With the TargetRegisterInfo::shouldCoalesce() hook implemented to not coalesce
the COPY from GR128 to GR64, the test case also compiles.
(NOTE: the SubReg / DstSubReg may have been flipped by the coalescer so that
DstSubReg argument to shouldCoalesce() actually designates the subregidx of the
source operand... Is this intentional?)

%vreg1 [16r,128r:0)  0@16r
%vreg3 [32r,48r:0)[48r,128r:1)[128r,144r:2)  0@32r 1@48r 2@128r
%vreg4 [144r,320r:0)  0@144r
RegMasks:
********** MACHINEINSTRS **********
# Machine code for function test: NoPHIs, TracksLiveness
Function Live Ins: %R2D in %vreg0, %R3D in %vreg1

0B      BB#0: derived from LLVM BB %init
            Live Ins: %R2D %R3D
16B             %vreg1<def> = COPY %R3D; GR64Bit:%vreg1
32B             %vreg3:subreg_l64<def,read-undef> = COPY %R2D; GR128Bit:%vreg3
48B             %vreg3:subreg_h64<def> = LLILL 0; GR128Bit:%vreg3
128B            %vreg3<def,tied1> = DLGR %vreg3<tied0>, %vreg1; GR128Bit:%vreg3
GR64Bit:%vreg1
144B            %vreg4<def> = COPY %vreg3:subreg_h64; GR64Bit:%vreg4
GR128Bit:%vreg3
176B            %R1D<def> = LGHI 0
192B            %R3D<def> = LGHI 0
208B            %R5D<def> = LGHI 0
224B            %R7D<def> = LGHI 0
240B            %R9D<def> = LGHI 0
256B            %R11D<def> = LGHI 0
272B            %R12D<def> = LGHI 0
288B            %R13D<def> = COPY %vreg4; GR64Bit:%vreg4
304B            INLINEASM <es:> [sideeffect] [attdialect], $0:[reguse],
%R1D<kill>, $1:[reguse], %R3D, $2:[reguse], %R5D<kill>, $3:[reguse],
%R7D<kill>, $4:[reguse], %R9D<kill>, $5:[reguse], %R11D<kill>, $6:[reguse],
%R12D<kill>, $7:[reguse], %R13D<kill>
320B            %R2D<def> = COPY %vreg4; GR64Bit:%vreg4
336B            Return %R2D<imp-use>

[%vreg1 -> %R3D] GR64Bit
[%vreg3 -> %R0Q] GR128Bit
[%vreg4 -> %R13D] GR64Bit

 *** IR Dump After Virtual Register Rewriter ***:
# Machine code for function test: NoPHIs, TracksLiveness, NoVRegs
Function Live Ins: %R2D, %R3D

0B      BB#0: derived from LLVM BB %init
            Live Ins: %R2D %R3D
32B             %R1D<def> = COPY %R2D, %R0Q<imp-def>
48B             %R0D<def> = LLILL 0, %R0Q<imp-use,kill>, %R0Q<imp-def>
128B            %R0Q<def,tied1> = DLGR %R0Q<kill,tied0>, %R3D<kill>
144B            %R13D<def> = COPY %R0D, %R0Q<imp-use,kill>
176B            %R1D<def> = LGHI 0
192B            %R3D<def> = LGHI 0
208B            %R5D<def> = LGHI 0
224B            %R7D<def> = LGHI 0
240B            %R9D<def> = LGHI 0
256B            %R11D<def> = LGHI 0
272B            %R12D<def> = LGHI 0
304B            INLINEASM <es:> [sideeffect] [attdialect], $0:[reguse], %R1D,
$1:[reguse], %R3D, $2:[reguse], %R5D<kill>, $3:[reguse], %R7D<kill>,
$4:[reguse], %R9D<kill>, $5:[reguse], %R11D<kill>, $6:[reguse], %R12D<kill>,
$7:[reguse], %
R13D
320B            %R2D<def> = COPY %R13D<kill>
336B            Return %R2D<imp-use>

In this case this is actually the same number of final COPYs as with subreg
liveness.

Proposed patch: https://reviews.llvm.org/D37899
Quuxplusone commented 6 years ago
(In reply to Quentin Colombet from comment #2)
> The way to work around it is to use the TargetRegisterInfo::shouldCoalesce
hook > to avoid creating live-ranges that are too "wide".

That seems to do the trick.  Thanks, Quentin!

(In reply to Jonas Paulsson from comment #3)
> As Matthias explained, by returning true in subRegLivenessEnabled(),
> regalloc succeeds:

Note that this a less-restricted test, which is indeed fixed.  However, the
more restricted test that is actually in this bugzilla still fails with
subRegLivenessEnabled (but is anyway fixed by your patch).  See the Phabricator
for more comments ...
Quuxplusone commented 6 years ago

trunk@314516