Several Mips, X86 CodeGen tests FAIL on Solaris/sparcv9

Quuxplusone commented 4 years ago


Bugzilla Link	PR47731
Status	RESOLVED FIXED
Importance	P normal
Reported by	Rainer Orth (ro@gcc.gnu.org)
Reported on	2020-10-05 09:29:08 -0700
Last modified on	2020-11-09 06:37:23 -0800
Version	trunk
Hardware	Sun Solaris
CC	craig.topper@gmail.com, efriedma@quicinc.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments	`cg86.patch` (4328 bytes, text/plain) `getConstVector.tar.bz2` (4633 bytes, application/x-bzip)
Blocks
Blocked by
See also

Several Mips and X86 CodeGen tests FAIL on Solaris/sparcv9:

  LLVM :: CodeGen/Mips/cconv/vector.ll
  LLVM :: CodeGen/Mips/msa/basic_operations.ll
  LLVM :: CodeGen/Mips/msa/i5-b.ll
  LLVM :: CodeGen/Mips/msa/immediates.ll
  LLVM :: CodeGen/X86/fp-intrinsics.ll
  LLVM :: CodeGen/X86/known-signbits-vector.ll
  LLVM :: CodeGen/X86/vec_shift5.ll
  LLVM :: CodeGen/X86/vector-shift-ashr-128.ll
  LLVM :: CodeGen/X86/vector-shift-ashr-256.ll

This only happens in 2-stage Release and RelWithDebInfo builds, 2-stage Debug
builds and 1-stage Release builds with gcc are fine.

Sometimes the differences are quite small, e.g. for
CodeGen/Mips/cconv/vector.ll:

--- vector.s.amd64  2020-10-05 18:21:13.102487784 +0000
+++ vector.s.sparcv9    2020-10-05 18:21:05.775611488 +0000
@@ -1548,7 +1548,7 @@
    sw  $ra, 36($sp)                    # 4-byte Folded Spill
    .cfi_offset 31, -4
    lui $1, 6
-   ori $1, $1, 7
+   ori $1, $1, 10
    lui $2, 9
    ori $2, $2, 10
    fill.w  $w0, $2

In other cases the differences are quite substantial.  I have no idea where even
to start investigating this, so I'll need much guidance here.

Quuxplusone commented 4 years ago

Standard tips for debugging miscompiles: bisect object files to find the specific object file containing the bug, use https://llvm.org/docs/OptBisect.html to narrow down from there.

(The fact that the program being miscompiled is the compiler itself doesn't change the general process.)

Quuxplusone commented 4 years ago

Please can you list the specific test functions within each CodeGen/X86* test
file that breaks and I'll take a look.

Or you could use the update script:

llvm-project\llvm\utils\update_llc_test_checks.py

to regenerate those files and attach the diffs.

Quuxplusone commented 4 years ago

(In reply to Simon Pilgrim from comment #2)
> Please can you list the specific test functions within each CodeGen/X86*
> test file that breaks and I'll take a look.

That would have been easy e.g. for fp-intrinsics.ll (f19), but ...

> Or you could use the update script:
>
> llvm-project\llvm\utils\update_llc_test_checks.py
>
> to regenerate those files and attach the diffs.

... this is way easier and the diffs are much smaller than those of the
assembler
output.

Quuxplusone commented 4 years ago

Attached cg86.patch (4328 bytes, text/plain): Diffs to affected llvm/test/CodeGen/X86 .ll files

Quuxplusone commented 4 years ago

(In reply to Rainer Orth from comment #6)
>
> As a first step, I've found that replacing LegalizeDAG.cpp.o in
> libLLVMSelectionDAG.a with the gcc-compiled version fixes the
> X86/fp-intrinsics.ll test, but none of the others, neither X86 nor Mips.

SelectionDAGLegalize::OptimizeFloatStore is a highly likely candidate to check
here

Quuxplusone commented 4 years ago

(In reply to Simon Pilgrim from comment #7)

> SelectionDAGLegalize::OptimizeFloatStore is a highly likely candidate to
> check here

Thanks, I'll have a look.

Meanwhile I've also identified the objects that cause the remaining failures:

* MipsSEISelDAGToDAG.cpp.o in libLLVMMipsCodeGen.a is responsible for the Mips
  ones.

* X86ISelLowering.cpp.o in libLLVMX86CodeGen.a causes the remaining X86
failures.

Quuxplusone commented 4 years ago

A common code pattern I'm seeing is

APInt x; // 64 bits
unsigned lo32 = x.trunc(32);
unsigned hi32 = x.lshr(32).trunc(32);

I've added some extra unit tests for these at rG3cb8347c94a0, its unlikely that
these will fail but if you can confirm that'd be great.

Quuxplusone commented 4 years ago

Following Eli's suggestion, I've now identified the passes that cause the
miscompilations:

* MipsSEISelDAGToDAG.cpp:

  BISECT: running pass (32328) Stack Slot Coloring on function (_ZN4llvm18MipsSEDAGToDAGISel9trySelectEPNS_6SDNodeE)

  llvm::MipsSEDAGToDAGISel::trySelect(llvm::SDNode*)

* LegalizeDAG.cpp:

  BISECT: running pass (51758) Stack Slot Coloring on function (_ZN12_GLOBAL__N_120SelectionDAGLegalize10LegalizeOpEPN4llvm6SDNodeE)

  (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode*)

* X86ISelLowering.cpp:

  BISECT: running pass (301313) Stack Slot Coloring on function (_ZL14getConstVectorN4llvm8ArrayRefINS_5APIntEEERS1_NS_3MVTERNS_12SelectionDAGERKNS_5SDLocE)

  getConstVector(llvm::ArrayRef<llvm::APInt>, llvm::APInt&, llvm::MVT, llvm::SelectionDAG&, llvm::SDLoc const&)

Quuxplusone commented 4 years ago

(In reply to Rainer Orth from comment #10)
> Following Eli's suggestion, I've now identified the passes that cause the
> miscompilations:

> * X86ISelLowering.cpp:
>   getConstVector(llvm::ArrayRef<llvm::APInt>, llvm::APInt&, llvm::MVT,
> llvm::SelectionDAG&, llvm::SDLoc const&)

This makes even more suspicious of the APInt operations.

Please can you try replacing

      Ops.push_back(DAG.getConstant(V.trunc(32), dl, EltVT));
      Ops.push_back(DAG.getConstant(V.lshr(32).trunc(32), dl, EltVT));

with:

      Ops.push_back(DAG.getConstant(V.extractBitsAsZExtValue(32, 0), dl, EltVT));
      Ops.push_back(DAG.getConstant(V.extractBitsAsZExtValue(32, 32), dl, EltVT));

And see if that makes a difference?

Quuxplusone commented 4 years ago

And please also try this variant as well:

      Ops.push_back(DAG.getConstant(V.extractBits(32, 0), dl, EltVT));
      Ops.push_back(DAG.getConstant(V.extractBits(32, 32), dl, EltVT));

Quuxplusone commented 4 years ago

Both variants alike fix the same four tests

-  LLVM :: CodeGen/X86/known-signbits-vector.ll
-  LLVM :: CodeGen/X86/vec_shift5.ll
-  LLVM :: CodeGen/X86/vector-shift-ashr-128.ll
-  LLVM :: CodeGen/X86/vector-shift-ashr-256.ll

without any other effect on test results.

Quuxplusone commented 4 years ago

Stack slot coloring, really? Hmm. I can't think of any way a target could misbehave to make that cause a miscompile.

Maybe try diffing the assembly of one of the functions with/without stackslotcoloring? getConstVector seems like the smallest of those three. There's also an option "-no-stack-slot-sharing" you could try experimenting with.

Quuxplusone commented 4 years ago

(In reply to Eli Friedman from comment #14)
> Stack slot coloring, really?  Hmm.  I can't think of any way a target could
> misbehave to make that cause a miscompile.

Don't hit the messenger ;-)  I'm just reporting what I found, and it's
consistent
for all miscompilations seen.

> Maybe try diffing the assembly of one of the functions with/without
> stackslotcoloring?  getConstVector seems like the smallest of those three.

That's what I tried (I'm attaching both versions of the assembly for reference).
However, at first blush the changes seem benign (just stack slot renumbering,
though I didn't check in detail that there are no overlaps, e.g.).  Once change
seems strange at first:

@@ -515,9 +515,8 @@
        stx %g0, [%fp+1487]
 .LBB603_46:                             ! %if.else.i6.i
                                         !   in Loop: Header=BB603_10 Depth=1
-       ld [%fp+1487], %g2
        ba .LBB603_48
-       stx %g2, [%fp+1487]
+       nop
 .LBB603_47:                             ! %if.end7.i.i
                                         !   in Loop: Header=BB603_10 Depth=1
        add %fp, 1487, %o0
@@ -526,9 +525,9 @@
 .LBB603_48:                             ! %llvm::APInt::lshr(unsigned int) const [clone .exit]

but the whole operation is a no-op AFAICS: first 64-bit 0 is stored in the stack
slot, then 32-bit loaded into %g2 and stored (as 64-bit, but still 0, of course)
back into the same slot.

I've also tried to run both versions of llc side by side in gdb, but that got
me nowhere.

> There's also an option "-no-stack-slot-sharing" you could try experimenting
> with.

Tried that, too, but it didn't make a difference in any of the three cases.

I can't help but feel that this is a total waste of my time: this is a spare-
time
activity and the result won't even be seen in the Solaris/sparcv9 buildbot
which is native-only.  I think this is something for some SPARC maintainer to
look into.

Quuxplusone commented 4 years ago

Attached getConstVector.tar.bz2 (4633 bytes, application/x-bzip): assembly output for miscompiled getConstantVector

Quuxplusone commented 4 years ago

Rainer - do the additional tests I added to APIntTest pass?

The only other places I can see the .lshr(32).trunc(32) pattern are:

ARMTargetLowering::LowerConstantFP
RISCVTargetLowering::PerformDAGCombine

Are they passing?

Quuxplusone commented 4 years ago

(In reply to Simon Pilgrim from comment #17)
> Rainer - do the additional tests I added to APIntTest pass?

They do.

> The only other places I can see the .lshr(32).trunc(32) pattern are:
>
> ARMTargetLowering::LowerConstantFP
> RISCVTargetLowering::PerformDAGCombine
>
> Are they passing?

They pass as well: the only non-native tests currently FAILing are the ones in
this PR.

Quuxplusone commented 4 years ago

I'm trying very hard not to just update the code and get the tests passing again :-)

Eli - any suggestions for the next step?

Quuxplusone commented 4 years ago

It looks like StackSlotColoring has one other possibly relevant flag: you could
try "-ssc-dce-limit=0".  I think that will remove the diff noted in the comment.

Beyond that, I think you could add a mechanism to limit the number of stack
slots that are colored by StackSlotColoring, and bisect using that?

Thinking about it a bit more, I have another theory for what might be going
wrong: some code could be using uninitialized data, and we're getting
lucky/unlucky. This can be nasty to diagnose... one trick I've used in the past
is hacking the assembly to memset the stack to zero before returning.

(In reply to Rainer Orth from comment #15)
> I can't help but feel that this is a total waste of my time: this is a
> spare-time
> activity and the result won't even be seen in the Solaris/sparcv9 buildbot
> which is native-only.  I think this is something for some SPARC maintainer to
> look into.

There is no SPARC maintainer, really; it's just various people keeping it
limping in their spare time.  So if you don't look, nobody is going to look.
(The primary reason we haven't just killed off the backend is that SPARC is
widely used in academia.)

(In reply to Simon Pilgrim from comment #19)
> I'm trying very hard not to just update the code and get the tests passing
> again :-)

I don't think trying to work around a SPARC backend miscompile is a good use of
anyone's time.

Quuxplusone commented 3 years ago

(In reply to Eli Friedman from comment #20)
> It looks like StackSlotColoring has one other possibly relevant flag: you
> could try "-ssc-dce-limit=0".  I think that will remove the diff noted in
> the comment.
>
> Beyond that, I think you could add a mechanism to limit the number of stack
> slots that are colored by StackSlotColoring, and bisect using that?
>
> Thinking about it a bit more, I have another theory for what might be going
> wrong: some code could be using uninitialized data, and we're getting
> lucky/unlucky. This can be nasty to diagnose... one trick I've used in the
> past is hacking the assembly to memset the stack to zero before returning.

I didn't get back to this in a while, but recently found that the bug has
gone away.  Looing at the new Linux/sparc64 buildbot
(http://lab.llvm.org:8014/#/builders/134)
which unlike my Solaris/sparcv9 one runs 2-stage builds), one can see that the
failures went away betweens builds 38 and 40.  Given the large number of changes
included in those two, it's difficult to tell exactly which patch fixed the
issue,
but it seems plausible that one of Simon's did.

Given that the failures haven't resurfaced since, I guess we can safely close
this PR.

> (In reply to Rainer Orth from comment #15)
> > I can't help but feel that this is a total waste of my time: this is a
> > spare-time
> > activity and the result won't even be seen in the Solaris/sparcv9 buildbot
> > which is native-only.  I think this is something for some SPARC maintainer
to
> > look into.
>
> There is no SPARC maintainer, really; it's just various people keeping it
> limping in their spare time.  So if you don't look, nobody is going to look.
> (The primary reason we haven't just killed off the backend is that SPARC is
> widely used in academia.)

I wasn't aware of that, given that the the llvm/CODE_OWNERS.TXT still lists
a SPARC maintainer.  Fortunately, the GCC situation is different here: we do
have an active SPARC maintainer and Oracle contributed several large patches to
support newer SPARC CPUs in recent years.

Quuxplusone commented 3 years ago

Its unlikely any of my patches affected this - I'm wondering if 7ba3293691beb is the cause.

Quuxplusone commented 3 years ago

(In reply to Simon Pilgrim from comment #22)
> Its unlikely any of my patches affected this - I'm wondering if
> 7ba3293691beb is the cause.

I had seen that one, too, but concluded that it was more about Windows EH and
looked elsewhere.  I've now run a full build at top-of-trunk with just that
patch revered: the failures are still gone.

I guess it's a waste of time to run a full reghunt to identify the patch that
actually fixed this.

Quuxplusone commented 3 years ago

(In reply to Rainer Orth from comment #23)
> I guess it's a waste of time to run a full reghunt to identify the patch that
> actually fixed this.

Agreed - if we now have suitable buildbot test coverage and its currently
green, then we can move on. Thanks!

Quuxplusone commented 3 years ago

(In reply to Simon Pilgrim from comment #24)

> Agreed - if we now have suitable buildbot test coverage and its currently
> green, then we can move on. Thanks!

We're not green yet, unfortunately.  The two important offenders are PRs 47729
(another miscompilation) and 42493 (not supporting 128-bit long double on 32-bit
SPARC).

I'm slowly working on the latter, but have no real grip on the former yet.

Quuxplusone / LLVMBugzillaTest

Several Mips, X86 CodeGen tests FAIL on Solaris/sparcv9 #46700