Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

[CorrelatedValuePropagation / LazyValueInfo] wrong code generation #44354

Closed Quuxplusone closed 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR45384
Status RESOLVED FIXED
Importance P release blocker
Reported by Jonas Paulsson (paulsson@linux.vnet.ibm.com)
Reported on 2020-04-01 05:39:15 -0700
Last modified on 2020-04-01 10:19:48 -0700
Version 10.0
Hardware PC Linux
CC ditaliano@apple.com, efriedma@quicinc.com, florian_hahn@apple.com, lebedev.ri@gmail.com, listmail@philipreames.com, llvm-bugs@lists.llvm.org, nikita.ppv@gmail.com, paulsson@linux.vnet.ibm.com, uweigand@de.ibm.com
Fixed by commit(s)
Attachments tc_corrprop.ll (344 bytes, text/plain)
Blocks
Blocked by
See also
Created attachment 23301
reduced testcase

Originates from https://bugs.launchpad.net/ubuntu/+source/postgresql-
12/+bug/1867173, where Ubuntu is building (bootstrapped) clang-10 and runs into
a seg fault with llvm-lto.

I believe I have found a bug in clang-10 which seems to likely be involved in
this problem. This is a reduced test case, which involves an i8 PHI with an
undef entry operand, which is and:ed with 127, clearing the sign bit:

define i32 @fun() {
entry:
  br label %bb1

bb1:
  %P = phi i8 [ undef, %entry ], [ %Or, %bb1 ]
  %Ld = load i8, i8* inttoptr (i64 8 to i8*), align 8
  %And1 = and i8 %Ld, -128
  %And2 = and i8 %P, 127
  %Or = or i8 %And2, %And1
  %C = icmp sgt i8 %Or, -1
  br i1 %C, label %bb1, label %exit

exit:
  %res = zext i1 %C to i32
  ret i32 %res
}

All the bits of %And2 are undef except the sign bit which is used by the icmp.
CorrelatedValuePropagation incorrectly replaces %And2 with %P which makes the
sign bit undefined.

This is observable at llvmorg-10.0.0-rc5 (3562703 / March 19), without the
bootstrap. The problem however (at least with my reduced test case) goes away
with 4878aa3 (March 14). That commit seems to be related exactly to undef
constants in this context. I don't know if that eliminates this bug, or if it
just gets hidden.

./bin/opt -S -mtriple=s390x-linux-gnu -mcpu=z10 ./tc_corrprop.ll --correlated-
propagation -o -
; ModuleID = './tc_corrprop.ll'
source_filename = "./tc_corrprop.ll"
target triple = "s390x-unknown-linux-gnu"

define i32 @fun() #0 {
entry:
  br label %bb1

bb1:                                              ; preds = %bb1, %entry
  %P = phi i8 [ undef, %entry ], [ %Or, %bb1 ]
  %Ld = load i8, i8* inttoptr (i64 8 to i8*), align 8
  %And1 = and i8 %Ld, -128
  %Or = or i8 %P, %And1
  %C = icmp sgt i8 %Or, -1
  br i1 %C, label %bb1, label %exit

exit:                                             ; preds = %bb1
  %res = zext i1 %C to i32
  ret i32 0
}

attributes #0 = { "target-cpu"="z10" }
Quuxplusone commented 4 years ago

Attached tc_corrprop.ll (344 bytes, text/plain): reduced testcase

Quuxplusone commented 4 years ago

I think this was indeed what 4878aa3 was supposed to fix. Jonas, do you know why non of the public SystemZ bots caught the issue?

_This bug has been marked as a duplicate of bug 44949_

Quuxplusone commented 4 years ago
(In reply to Florian Hahn from comment #1)
> I think this was indeed what 4878aa3 was supposed to fix. Jonas, do you know
> why non of the public SystemZ bots caught the issue?
>
> *** This bug has been marked as a duplicate of bug 44949 ***

Ah, I see. Great that you fixed it :-)

I guess this didn't show up on SystemZ until the postgresql-12 package was
built (via clang bootstrapping) and llvm-lto crashed. Is there any
configuration in particular you have in mind?

I tried to cherry-pick your patch (4878aa3) onto llvm10-rc5, but it did not
apply cleanly. I think the Ubuntu release that ships with llvm-10 needs this
patch so I wonder what to do... Could you perhaps help with the merge since you
have worked with it?
Quuxplusone commented 4 years ago
There's already a bug to port it to 10.0.1 (PR45272) with a back ported patch
https://reviews.llvm.org/D76596.

> I guess this didn't show up on SystemZ until the postgresql-12 package was
built (via clang bootstrapping) and llvm-lto crashed. Is there any
configuration in particular you have in mind?

Are there ppc/s390 public bootstrap bots? If there are, I am curious why they
did not catch the problem. The issue was raised just a few days before the
final release and because the public ppc/390 bots seemed fine it was thought
not to be too critical.
Quuxplusone commented 4 years ago
(In reply to Florian Hahn from comment #3)
> There's already a bug to port it to 10.0.1 (PR45272) with a back ported
> patch https://reviews.llvm.org/D76596.
>
> > I guess this didn't show up on SystemZ until the postgresql-12 package was
> built (via clang bootstrapping) and llvm-lto crashed. Is there any
> configuration in particular you have in mind?
>
> Are there ppc/s390 public bootstrap bots? If there are, I am curious why
> they did not catch the problem. The issue was raised just a few days before
> the final release and because the public ppc/390 bots seemed fine it was
> thought not to be too critical.

I see a "multistage" buildbot, but looking at the logs I can't see -
DCLANG_ENABLE_BOOTSTRAP=ON being passed to cmake. I thought there were supposed
to be a bot for expensive checks and also for bootstrapping, but I am not
sure...

This is however not a SystemZ backend bug, so it could have shown up on any
machine, right? I just checked and it seems at least that a bootstrapped build
on that commit passes the SystemZ CodeGen tests. So to me it seems that rc5
builds just fine with a bootstrap, but on a particular input llvm-lto crashes.
That input was in this case an Ubuntu package, which is not part of any testing.

But it's a good point - I will try to set up some more testing involving a
bootstrapped clang shortly.
Quuxplusone commented 4 years ago
We are running a multistage build bot:
http://lab.llvm.org:8011/builders/clang-s390x-linux-multistage
which does indeed perform a two-stage clang build.

However, the second-stage clang is then only used to run the unit test suite,
and that may not be enough to expose the bug.  As I understand, the symptom is
a crash in the bootstrapped clang executable, but that crash doesn't happen
always, but only when compiling certain input files.  (In the original bug
report, this was some file in the postgres package.)

We also have a separate LNT build bot that runs all of test-suite, but this
doesn't use a bootstrapped clang, but just a regular stage one build.  Maybe we
should combine the two?  (Of course, it's not definite that even building all
of test-suite would have exposed the same bug as when building postgres ...)
Quuxplusone commented 4 years ago
(In reply to Ulrich Weigand from comment #5)
> ... a crash in the bootstrapped clang executable ...

Sorry, I mis-remembered.  Jonas is correct, it was llvm-lto not clang.
Quuxplusone commented 4 years ago
(In reply to Ulrich Weigand from comment #5)
> We are running a multistage build bot:
> http://lab.llvm.org:8011/builders/clang-s390x-linux-multistage
> which does indeed perform a two-stage clang build.
>
> However, the second-stage clang is then only used to run the unit test
> suite, and that may not be enough to expose the bug.  As I understand, the
> symptom is a crash in the bootstrapped clang executable, but that crash
> doesn't happen always, but only when compiling certain input files.  (In the
> original bug report, this was some file in the postgres package.)
>

I think this issue would have showed up on a 3-stage bootstrap with LTO.
PR45272 mentioned that something bin BitcodeReader got mis-optimized which
should be used during LTO. Running a 3-stage bootstrap would ensure that the
stage-2 compiler is run on a larger codebase than just the unit tests, but such
a setup takes quite a long time to run :(