inefficient floating point register ld/st instructions are generated when -mcpu=power8 is specified

Quuxplusone commented 9 years ago


Bugzilla Link	PR24817
Status	CONFIRMED
Importance	P normal
Reported by	Carrot (carrot@google.com)
Reported on	2015-09-14 17:08:17 -0700
Last modified on	2017-05-24 09:56:40 -0700
Version	trunk
Hardware	PC Linux
CC	hfinkel@anl.gov, kit.barton@gmail.com, llvm-bugs@lists.llvm.org, nemanja.i.ibm@gmail.com, wschmidt@linux.vnet.ibm.com
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

The simple test case is:

int foo(float *p)
{
  float f1 = p[1];
  float f2 = p[2];
  return f1 > f2;
}

Compiler command line and output

clang -S t8.c --target=powerpc -m64 -mcpu=power8 -O2
    li 4, 4             // A
    li 5, 8             // B
    lxsspx 0, 3, 4      // C
    lxsspx 1, 3, 5      // D
    li 3, 0
    li 4, 1
    fcmpu 0, 0, 1
    isel 3, 4, 3, 1
    blr

If I change -mcpu=power8 to -mcpu=power7, I got

    lfs 0, 4(3)        // E
    lfs 1, 8(3)        // F
    li 3, 0
    li 4, 1
    fcmpu 0, 0, 1
    isel 3, 4, 3, 1
    blr

Instructions AC are replaced by E
Instructions BD are replaced by F

Quuxplusone commented 8 years ago

The reason this happens is that we have the AddedComplexity set on VSX
instructions to favour them and the VSX indexed loads/stores are defined within
that block.
Of course, the exact same thing happens with double precision values. However,
this isn't P8 specific since we've had indexed double precision VSX loads in P7
as well.

The easy solution for this problem is to extract the VSX indexed loads from the
AddedComplexity block. However, we will then never load such values in the
upper VSX registers.

Quuxplusone commented 8 years ago

(In reply to comment #1)
> The reason this happens is that we have the AddedComplexity set on VSX
> instructions to favour them and the VSX indexed loads/stores are defined
> within that block.
> Of course, the exact same thing happens with double precision values.
> However, this isn't P8 specific since we've had indexed double precision VSX
> loads in P7 as well.
>
> The easy solution for this problem is to extract the VSX indexed loads from
> the AddedComplexity block. However, we will then never load such values in
> the upper VSX registers.

It might be better to peephole this after register allocation. You want the
larger register classes to be available during RA for high-register-pressure
situations. If that turns out to be unnecessary, we could relax after the fact.
This approach is not exactly optimal, but seems better than the other
relatively-easy alternatives.

Quuxplusone commented 8 years ago

Since Power9 has scalar D-Form loads and stores that can operate on the full
VSX register set, this is not an issue on Power9. The code produced for this on
Power9 is the same as the code on Power7.
I think that given the limitation of Power8 in terms of not having D-Form loads
for all the VSX registers along with the intent to favour VSX due to the larger
register set, we should leave the Power8 code as is and close this PR. Please
let me know what you think.

Quuxplusone commented 7 years ago

Carrot, are you OK with closing this PR as a limitation on P8?

Quuxplusone commented 7 years ago

It's OK to me.

Quuxplusone / LLVMBugzillaTest

inefficient floating point register ld/st instructions are generated when -mcpu=power8 is specified #24816