JumpThreading pass causes wrong optimization results in combination with indirect branches (indirectbr)

Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

JumpThreading pass causes wrong optimization results in combination with indirect branches (indirectbr) #39962

Closed Quuxplusone closed 5 years ago

Quuxplusone commented 5 years ago


Bugzilla Link	PR40992
Status	RESOLVED FIXED
Importance	P normal
Reported by	Matthias Liedtke (matthias.liedtke@sap.com)
Reported on	2019-03-07 07:05:59 -0800
Last modified on	2019-10-17 08:37:03 -0700
Version	trunk
Hardware	All All
CC	brzycki@gmail.com, hfinkel@anl.gov, htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org, llvm@insonuit.org, matthias.liedtke@sap.com
Fixed by commit(s)	rL357930
Attachments	`indirectbr2.ll` (1648 bytes, text/plain) `graph_states.jpg` (208705 bytes, image/jpeg) `testJT.sh` (2173 bytes, text/x-sh) `graph_patch_PredWithKnownDest.jpg` (76448 bytes, image/jpeg) `reduced.c` (518 bytes, text/plain)
Blocks
Blocked by
See also

Created attachment 21574
Testcase

Hi,

it seems that I have encountered a bug in the JumpThreading pass that is caused
by the fact that predecessors which terminate with an indirect branch
instruction are skipped in the pass.

I added the corresponding testcase as an attachment.

This is how the testcase works:

There is a condition "%cond" that is evaluated in the beginning. The
interesting path is the one where this condition is true.
In the true case we jump to "condTrue" where we have a select to select between
two different targets. The first target "indirectA" isn't the interesting one
but it seems to be necessary to cause the failure. The second target
"indirectB" jumps to the same block that would also be reachable if "%cond" was
false in the beginning which is labeled "condFalse".
In condFalse we have a branch on "%cond".
What happens here: The JumpThreading sees the branch and checks the
predecessors and finds out that there is one predecessor where we know that
"%cond" is false (the one if coming from basicblock "start"). The other one has
an indirect branch instruction and is skipped because of the indirect branch.
The JumpThreading thinks "Great, only one predecessor to take care of and that
one knows that %cond is false, so I can remove the block condFalse and set all
predecessors to condStillFalse."
It just doesn't notice the indirect branch that then will point to
"condStillFalse" instead of the expected result "condFalse" or "condStillTrue"
if optimization would work correctly.

I have encountered this issue on LLVM5.0 but it also exists in the current
master branch on github.

I was able to apply a fix for the problem:
In the JumpThreading there is the following coding:
    // If the predecessor ends with an indirect goto, we can't change its
    // destination. Same for CallBr.
    if (isa<IndirectBrInst>(Pred->getTerminator()) ||
        isa<CallBrInst>(Pred->getTerminator()))
      continue;

If the "continue;" is replaced by a "return false;" the problem is gone because
we completely skip the JumpThreading optimization as soon as one of the
predecessors finishes with an indirect branch instruction. (The CallBrInst has
been added recently for asm-goto support and I don't know if it has the same
issue but I'd expect so.)
Obviously this will prevent optimizations that beforehand had been possible and
might have been correct.
This fix will break the test callbr-edge-split.ll which has been added
recently. I am not sure if this failure is caused by a missed optimization.

If you agree on the issue and agree with the fix, I can contribute it.
Otherwise I'd be interested in details how this issue could be fixed in a way
that doesn't completely stop the JumpThreading optimization when indirect
branches are involved.

Thanks in advance and best regards,
Matthias

Quuxplusone commented 5 years ago

Attached indirectbr2.ll (1648 bytes, text/plain): Testcase

Quuxplusone commented 5 years ago

Hi Craig,

you added the callbr instruction to LLVM that might also be affected by this
bug if I understood it correctly.
Could you take a short look at this and assign it accordingly?

Thanks a lot in advance and best regards,
Matthias

Quuxplusone commented 5 years ago

Hello Matthias, did you make indirectbr2.ll by hand or was it emitted from clang?

The reason I ask is because the JumpThreading pass is very much tied to its position when called in the pass manager. Several of the code shape assumptions in ProcessBlock() are dependent on prior passes working on the IR. There are a few valid IR shapes that can make JumpThreading unhappy when manually passed in via opt and I've tried to fix them as I find them.

Quuxplusone commented 5 years ago

Hello Matthias, I don't believe this is a bug.

The JumpThreading pass in LLVM is a function-level transform. What this means
is it does not have knowledge beyond the boundary of a function. What happens
when you run this through opt -jump-threading is it examines @test1 with no
knowledge of @main, and then is run again on @main with no knowledge of @test1.

In your example you are inferring extra non-function logic, at the *module*
level, to see that @test1 is only called from @main and its parameters are
constants.

This is information JumpThreading does not have.  It has to assume @test1 is
called in multiple instances where parameters are not necessarily fixed. In
essence your test emits the exact same code if you remove @main.

However, in conjunction with other passes, clang as a whole is able to find the
optimization at -O3.  If you run:

$ clang -O3 -c indirectbr2.ll -emit-llvm -S -o out.ll
$ cat out.ll
; ModuleID = 'indirectbr2.ll'
source_filename = "indirectbr2.ll"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: norecurse nounwind readnone
define i32 @test1(i1 %val1, i1 %val2, i1 %val3) local_unnamed_addr #0 {
entry:
  %cond = and i1 %val1, %val2
  %brmerge.demorgan = and i1 %cond, %val3
  br i1 %brmerge.demorgan, label %start.us, label %condFalse

start.us:                                         ; preds = %entry, %start.us
  br label %start.us

condFalse:                                        ; preds = %entry
  %not.cond = xor i1 %cond, true
  %spec.select = zext i1 %not.cond to i32
  ret i32 %spec.select
}

; Function Attrs: norecurse nounwind readnone
define i32 @main() local_unnamed_addr #0 {
  ret i32 0
}

attributes #0 = { norecurse nounwind readnone }

You'll see a few things. First, @main returns 0 and never calls @test1. Through
several layers of transform and analysis it has determined the call to @test1
with the given constants eventually lands on %CondStillTrue and returns 0.

The second thing you'll notice is @test1 still exists. This is in case external
objects call @test1. The indirect branch is also removed (not sure by which
pass) and the code is simplified. But at the end of the day, the parameters of
@test1 are still not treated as constants, because they can't be.

Quuxplusone commented 5 years ago

Attached graph_states.jpg (208705 bytes, image/jpeg): cfg graph picture

Quuxplusone commented 5 years ago

Hi again Matthias,

I've re-read your comment and realize the test-case is less important than the idea of attacking the problem it showcases for IndirectBr. Apologies for the explanation about function passes and JT. :)

I'm a bit confused by your proposed patch and why you think it fixes the problem of a missed opportunity. I've attached graph_states.jpg showing the original (left), JumpThreading today (top right), and your proposed patch (bottom right). Of the three the one in the top right shows the most threaded opportunities discovered which is what we have today. Using your proposed fix reduces the number of jumpthreaded blocks so I'm struggling to understand why you prefer the bottom-right CFG.

If you are talking about the edge path %start -> %condTrue -> %condFalse that cannot be threaded by JumpThreading today but it's not because of the IndirectBr terminator. It's because %condTrue is inside a loop (%start (header), %condTrue, %indirectA (latch)) and is exiting the loop to %condFalse on one path. I saw you mentioned that the problem doesn't occur when %indirectA isn't there (creating this loop).

Getting JT to comprehend loops is a non-trivial undertaking and is why I worked to add up-to-date dominance information to JumpThreading. There is still considerable work in preserving loops across all of JT, which isn't true today. Until that happens we cannot use LI to comprehend loop levels and attempt threading an edge into, or out of, a loop.

I hope I've understood your question so please let me know if I still don't get your point. :)

Quuxplusone commented 5 years ago

Hi Brian,

thanks a a lot for your support.

> did you make indirectbr2.ll by hand or was it emitted from clang?
It is a simplified example of what is generated by a compiler we have
implemented that uses the LLVM as a backend. I am not sure and haven't
evaluated if a similar IR could somehow be generated by Clang using C.

> The reason I ask is because the JumpThreading pass is very much tied to its
position when called in the pass manager.
> However, in conjunction with other passes, clang as a whole is able to find
the optimization at -O3.
> I'm a bit confused by your proposed patch and why you think it fixes the
problem of a missed opportunity.

This might be a misunderstanding as my bug report isn't about whether the
JumpThreading pass is able to simplify unneeded branches, my concern is that it
seems to change the behavior of the generated code or in other words the
optimization changes the program in a way that changes its behavior when
executed. This seems to be caused by wrong assumptions about whether an
expression result is known for specific paths in the CFG or not.

So in the example I would expect that the compiled program terminates with 0,
while after running the JumpThreading pass it will terminate with 1 because a
different BasicBlock is reached as the evaluation of %cond in the BasicBlock
"condFalse" is simplified to false which is correct for the direct predecessor
"start" but it is wrong for the predecessor "condTrue" which seems to be
undetected by the optimization step and results in the indirect branch
instruction to jump to "condStillFalse" while this should be "condStillTrue" or
not optimized at all.
The "Early return false" I implemented skips the JumpThreading optimization for
a BasicBlock that is reached by an indirect branch, so the optimization doesn't
trigger and doesn't cause the changed behavior.

I don't fully understand what causes the wrong optimization and where it
exactly occurs in the coding and I am not familiar with the coding.

Thanks once again and best regards,
Matthias

Quuxplusone commented 5 years ago

Attached testJT.sh (2173 bytes, text/x-sh): Test script to show different behavior

Quuxplusone commented 5 years ago

Hello Matthias, I appreciate your the extended explanation. I did not initially notice the optimized JT version completely eliminated any chance of "ret i32 0". I will need to step through JT with GDB to see the IR shape, and state, that decided this thread was allowed when it should not be. I will do so and report back my findings.

Thank you for testJS.sh. I was able to reproduce the error with your shell script with a minor change: llc required the argument "-relocation-model=pic".

It's interesting, but not entirely surprising, the "clang -O3 -c indirectbr2.ll -emit-llvm -S -o out.ll" version is correct. I strongly suspect this bug only happens when you call via opt directly into JumpThreading. As I mentioned before, JT's position in the pass pipeline can mask bugs because certain code shapes are never emitted prior to JTs execution. I strongly suspect you will not be able to cause this error via clang and a c/c++ file.

Nevertheless, it is a bug, and an optimization is performed on the IR that should not be.

Quuxplusone commented 5 years ago

Attached graph_patch_PredWithKnownDest.jpg (76448 bytes, image/jpeg): cfg graph picture with ++PredWithKnownDest patch

Quuxplusone commented 5 years ago

Hello Matthias, I did some tracing in GDB.

First I wanted to make sure the analysis by LVI was correct. It was:

ComputeValueKnownInPredecessors():

P = %condTrue
BB = %condFalse
LVI determines on this path %cond is true

P = %start
BB = %condFalse
LVI determines on this path %cond is false

We know we have two predecessors and we (correctly) know each predecessor has a
different value for %cond.

We return from ComputeValueKnownInPredecessors() and come back into
ProcessThreadableEdges().

We then enter the predecessor analysis loop:

for (const auto &PredValue : PredValues) {
...

And the first of our problems start:

    // If we have exactly one destination, remember it for efficiency below.
    if (PredToDestList.empty()) {
      OnlyDest = DestBB;
      OnlyVal = Val;
    ....

This is true on the first pass through, giving us the following state:

Pred = %condTrue
BB = %condFalse
DestBB = %condStillTrue
Val = true
PredToDestList is empty
OnlyDest = %condStillTrue
OnlyVal = true

Next, we increment PredWithKnownDest. This is the second of our problems.

On this iteration we hit the continue state: %condTrue ends with an IndirectBr.

The next iteration, we hit this check again:
    // If we have exactly one destination, remember it for efficiency below.
    if (PredToDestList.empty()) {
      OnlyDest = DestBB;
      OnlyVal = Val;
    ....

And we reassign OnlyDest and OnlyVal to our new state because PredToDestList is
still empty (because of the continue). Here's what it looks like now:

Pred = %start
BB = %condFalse
DestBB = %condStillFalse
Val = false
PredToDestList is STILL empty
OnlyDest = %condStillFalse
OnlyVal = false

We increment PredWithKnownDest again and then push the pair of (%start,
%condStillFalse) to PredToDestList.

Our state is now invalid. The early continue incremented PredWithKnownDest
twice but only inserted to PredWithKnownDest _once_.

Right after this loop we have the check to see if folding can occur:

if (OnlyDest && OnlyDest != MultipleDestSentinel) {
    if (PredWithKnownDest == (size_t)pred_size(BB)) {
    ....

And sure enough we pass every one of those checks. We pre-saved OnlyDest to a
valid BB pointer and PredWithKnownDest == 2 which is the same as pred_size(BB).
We _incorrectly_ fold the false case, discarding the true path.

The goal here is to pessimize folding but still allow for JumpThreading of the
false path. We want to thread %start -> %condFalse -> %condStillFalse but leave
the %condTrue path alone.

A quick grep of the source shows PredWithKnownDest is being used as a proxy for
PredToDestList.size(). It's _only_ checked when attempting to fold.

With this small patch:

diff --git a/llvm/lib/Transforms/Scalar/JumpThreading.cpp
b/llvm/lib/Transforms/Scalar/JumpThreading.cpp
index 264ea3aa22a..3c6d10ba0d6 100644
--- a/llvm/lib/Transforms/Scalar/JumpThreading.cpp
+++ b/llvm/lib/Transforms/Scalar/JumpThreading.cpp
@@ -1643,15 +1643,15 @@ bool JumpThreadingPass::ProcessThreadableEdges(Value
*Cond, BasicBlock *BB,
         OnlyVal = MultipleVal;
     }

-    // We know where this predecessor is going.
-    ++PredWithKnownDest;
-
     // If the predecessor ends with an indirect goto, we can't change its
     // destination. Same for CallBr.
     if (isa<IndirectBrInst>(Pred->getTerminator()) ||
         isa<CallBrInst>(Pred->getTerminator()))
       continue;

+    // We know where this predecessor is going.
+    ++PredWithKnownDest;
+
     PredToDestList.push_back(std::make_pair(Pred, DestBB));
   }

We correctly synchronize the count of PredWithKnownDest and the size of
PredToDestList. I could have called PredToDestList.size() directly but I don't
know the history here, maybe it's faster to count while looping and this is
expensive code.

This change produces the optimum correct JT result for your indirectBr.ll test.
It also does not break callbr-edge-split.ll.

I am about to run it on test-suite to see if there are any issues. If not I'll
open a Phabricator issue.

Quuxplusone commented 5 years ago

By the way, the reason we don't see this with clang -O3 is the indirectbr is
optimized out before the first call to jumpthreading occurs. Here's the state
of @test1() the first time JT see it:

Breakpoint 1, llvm::JumpThreadingPass::runImpl (this=0x6a42a0, F=...,
TLI_=0x6ac358,
    LVI_=0x6a4660, AA_=0x72d950, DTU_=0x7fffffffa4f8, HasProfileData_=false,
    BFI_=std::unique_ptr<llvm::BlockFrequencyInfo> = {...},
    BPI_=std::unique_ptr<llvm::BranchProbabilityInfo> = {...})
    at /work/b.rzycki/upstream/llvm-project/llvm/lib/Transforms/Scalar/JumpThreading.cpp:343
(gdb) p F.dump()

; Function Attrs: norecurse nounwind readnone
define i32 @test1(i1 %val1, i1 %val2, i1 %val3) local_unnamed_addr #0 {
entry:
  br label %start

start:                                            ; preds = %start, %entry
  %cond = and i1 %val1, %val2
  %brmerge.demorgan = and i1 %cond, %val3
  br i1 %brmerge.demorgan, label %start, label %condFalse

condFalse:                                        ; preds = %start
  %not.cond = xor i1 %cond, true
  %spec.select = zext i1 %not.cond to i32
  ret i32 %spec.select
}

Another pass identified the indirectBr as unnecessary and converted it to a
standard conditional br.

Quuxplusone commented 5 years ago

Phabricator review created: https://reviews.llvm.org/D60284

Quuxplusone commented 5 years ago

(In reply to Brian Rzycki from comment #8)
> I strongly suspect you will not be able to cause this error via clang and a
c/c++
> file.

As it happens, you can, though it requires the use of the 'indirect goto' GNU
extension.  We ran into this as part of upgrading our product's OS base to a
newer FreeBSD version, which uses clang 5.0.0.  I verified that the error is
reproducible on clang 6.0.0 as well (FreeBSD 12's shipping compiler), and on
Apple's current XCode compiler ("Apple clang version 11.0.0").

Backporting this fix to clang 5.0.0 solved our miscompilation.

I'll attach a reduced test case.  As Matthias had found, two indirect targets
are necessary, together with a boolean which takes different values along one
path with an indirect jump and a second path of normal execution.

Quuxplusone commented 5 years ago

Attached reduced.c (518 bytes, text/plain): clang test case which reproduces the bug