bottom-up register-reduction scheduling actually increases register pressure

Quuxplusone commented 17 years ago


Bugzilla Link	PR1075
Status	RESOLVED FIXED
Importance	P normal
Reported by	Dan Gohman (llvm@sunfishcode.online)
Reported on	2007-01-04 10:35:07 -0800
Last modified on	2010-02-22 12:44:26 -0800
Version	trunk
Hardware	PC Linux
CC	clattner@nondot.org, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

A high Sethi-Ullman number for an operation indicates it should be scheduled
earlier rather than later -- from a top-down perspective. However, in bottom-up
scheduling currently Seth-Ullman numbers are translated directly to priority
values, so such operations end up being scheduled earlier from a *bottom-up*
perspective. The result is that register pressure is actually increased instead
of reduced. The effect is somewhat hidden because the special case values are
also inverted -- for example, a store is given a low Sethi-Ullman number, but it
ends up having the desired effect. In larger test cases, there are fewer nodes
that use the special cases, and the effect is much more visible.

I've experimented a little with having the priority function take the inverse of
the SethiUllman value, and inverting the special-case values, and I've seen
spill counts drop significantly in large test cases on x86-64.

Here's a simple case that shows the problem:

float %foo(float %x) {
        %tmp1 = mul float %x, 3.000000e+00
        %tmp3 = mul float %x, 5.000000e+00
        %tmp5 = mul float %x, 7.000000e+00
        %tmp7 = mul float %x, 1.100000e+01
        %tmp10 = add float %tmp1, %tmp3
        %tmp12 = add float %tmp10, %tmp5
        %tmp14 = add float %tmp12, %tmp7
        ret float %tmp14
}

With -march=x86-64, the list-burr (default) schedule places all the multiplies
before all the adds. With the change mentioned above, the multiplies are
intermixed with the adds, so the register pressure is lower.

Quuxplusone commented 17 years ago

I am going to take a closer look at his. But I don't see the problem you are
describing with the given example:

*** Final schedule ***
SU(0): 0x7b048a0: ch = EntryToken
SU(11): 0x7b055a0: f32,ch = MOVSSrm 0x7b04eb0, 0x7b05c20, 0x7b04e50, 0x7b053e0,
0x7b048a0
SU(9): 0x7b04fe0: f32,ch = MULSSrm 0x7b055a0, 0x7b04e50, 0x7b05c20, 0x7b04e50,
0x7b057c0, 0x7b048a0
SU(10): 0x7b04e00: f32,ch = MULSSrm 0x7b055a0, 0x7b04e50, 0x7b05c20, 0x7b04e50,
0x7b05540, 0x7b048a0
SU(1): 0x7b05090: f32 = ADDSSrr 0x7b04e00, 0x7b04fe0
SU(8): 0x7b05de0: f32,ch = MULSSrm 0x7b055a0, 0x7b04e50, 0x7b05c20, 0x7b04e50,
0x7b05a80, 0x7b048a0
SU(2): 0x7b05110: f32 = ADDSSrr 0x7b05090, 0x7b05de0
SU(7): 0x7b06100: f32,ch = MULSSrm 0x7b055a0, 0x7b04e50, 0x7b05c20, 0x7b04e50,
0x7b05d80, 0x7b048a0
SU(3): 0x7b05190: f32 = ADDSSrr 0x7b05110, 0x7b06100
SU(6): 0x7b05440: ch = MOVSSmr 0x7b04f50, 0x7b05c20, 0x7b04e50, 0x7b053e0,
0x7b05190, 0x7b048a0
SU(5): 0x7b06240: f64,ch = FpLD32m 0x7b04f50, 0x7b05c20, 0x7b04e50, 0x7b053e0,
0x7b05440
SU(4): 0x7b04ba0: ch = RET 0x7b05200, 0x7b05200:1
    0x7b05200: ch,flag = FpSETRESULT 0x7b06240, 0x7b06240:1

foo:
        subl $4, %esp
        movss 8(%esp), %xmm0
        movaps %xmm0, %xmm1
        mulss .LCPI1_0, %xmm1
        movaps %xmm0, %xmm2
        mulss .LCPI1_1, %xmm2
        addss %xmm1, %xmm2
        movaps %xmm0, %xmm1
        mulss .LCPI1_2, %xmm1
        addss %xmm1, %xmm2
        mulss .LCPI1_3, %xmm0
        addss %xmm0, %xmm2
        movss %xmm2, (%esp)
        flds (%esp)
        addl $4, %esp
        ret

Multiplies are intermixed with the adds. Dan, are you working with tot?

Quuxplusone commented 17 years ago

Hmmm... I see there are some issues. Hopefully I can fix the general cases
without breaking the special ones.

Quuxplusone commented 17 years ago

I used -march=x86-64 above; the code you posted looks like 32-bit code. Offhand
I don't know why it schedules differently on x86 and x86-64; maybe the special
cases are applying differently.

Here's a case that shows the problem on both x86 and x86-64. In this test, the
div instructions should precede the mul instructions, according to pure
Sethi-Ullman numbering.

float %foo(float* %a) {
entry:
        %tmp = load float* %a
        %tmp4 = getelementptr float* %a, int 1
        %tmp5 = load float* %tmp4
        %tmp6 = mul float %tmp, %tmp5
        %tmp8 = getelementptr float* %a, int 2
        %tmp9 = load float* %tmp8
        %tmp11 = getelementptr float* %a, int 3
        %tmp12 = load float* %tmp11
        %tmp13 = mul float %tmp9, %tmp12
        %tmp14 = mul float %tmp6, %tmp13
        %tmp16 = getelementptr float* %a, int 4
        %tmp17 = load float* %tmp16
        %tmp19 = getelementptr float* %a, int 5
        %tmp20 = load float* %tmp19
        %tmp21 = fdiv float %tmp17, %tmp20
        %tmp23 = getelementptr float* %a, int 6
        %tmp24 = load float* %tmp23
        %tmp26 = getelementptr float* %a, int 7
        %tmp27 = load float* %tmp26
        %tmp28 = fdiv float %tmp24, %tmp27
        %tmp29 = fdiv float %tmp21, %tmp28
        %tmp31 = getelementptr float* %a, int 8
        %tmp32 = load float* %tmp31
        %tmp34 = getelementptr float* %a, int 9
        %tmp35 = load float* %tmp34
        %tmp36 = fdiv float %tmp32, %tmp35
        %tmp38 = getelementptr float* %a, int 10
        %tmp39 = load float* %tmp38
        %tmp41 = getelementptr float* %a, int 11
        %tmp42 = load float* %tmp41
        %tmp43 = fdiv float %tmp39, %tmp42
        %tmp44 = fdiv float %tmp36, %tmp43
        %tmp45 = fdiv float %tmp29, %tmp44
        %tmp46 = add float %tmp14, %tmp45
        ret float %tmp46
}

Quuxplusone commented 17 years ago

Yeah, x86 and x86-64 cases scheduled differently because of calling convention
differences (x86 case has an extra store).

I have fixed this. Doing some tests now.

Quuxplusone commented 17 years ago

Fixed.

Quuxplusone commented 17 years ago

Thanks Evan, please add a test for this. :)

Quuxplusone commented 17 years ago

ah, you did, it's Regression/CodeGen/X86/2007-01-08-InstrSched.ll

Thanks :)

Quuxplusone / LLVMBugzillaTest

bottom-up register-reduction scheduling actually increases register pressure #1075