Closed Quuxplusone closed 5 years ago
Attached indirectbr2.ll
(1648 bytes, text/plain): Testcase
Hi Craig,
you added the callbr instruction to LLVM that might also be affected by this
bug if I understood it correctly.
Could you take a short look at this and assign it accordingly?
Thanks a lot in advance and best regards,
Matthias
Hello Matthias, did you make indirectbr2.ll by hand or was it emitted from clang?
The reason I ask is because the JumpThreading pass is very much tied to its position when called in the pass manager. Several of the code shape assumptions in ProcessBlock() are dependent on prior passes working on the IR. There are a few valid IR shapes that can make JumpThreading unhappy when manually passed in via opt and I've tried to fix them as I find them.
Hello Matthias, I don't believe this is a bug.
The JumpThreading pass in LLVM is a function-level transform. What this means
is it does not have knowledge beyond the boundary of a function. What happens
when you run this through opt -jump-threading is it examines @test1 with no
knowledge of @main, and then is run again on @main with no knowledge of @test1.
In your example you are inferring extra non-function logic, at the *module*
level, to see that @test1 is only called from @main and its parameters are
constants.
This is information JumpThreading does not have. It has to assume @test1 is
called in multiple instances where parameters are not necessarily fixed. In
essence your test emits the exact same code if you remove @main.
However, in conjunction with other passes, clang as a whole is able to find the
optimization at -O3. If you run:
$ clang -O3 -c indirectbr2.ll -emit-llvm -S -o out.ll
$ cat out.ll
; ModuleID = 'indirectbr2.ll'
source_filename = "indirectbr2.ll"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
; Function Attrs: norecurse nounwind readnone
define i32 @test1(i1 %val1, i1 %val2, i1 %val3) local_unnamed_addr #0 {
entry:
%cond = and i1 %val1, %val2
%brmerge.demorgan = and i1 %cond, %val3
br i1 %brmerge.demorgan, label %start.us, label %condFalse
start.us: ; preds = %entry, %start.us
br label %start.us
condFalse: ; preds = %entry
%not.cond = xor i1 %cond, true
%spec.select = zext i1 %not.cond to i32
ret i32 %spec.select
}
; Function Attrs: norecurse nounwind readnone
define i32 @main() local_unnamed_addr #0 {
ret i32 0
}
attributes #0 = { norecurse nounwind readnone }
You'll see a few things. First, @main returns 0 and never calls @test1. Through
several layers of transform and analysis it has determined the call to @test1
with the given constants eventually lands on %CondStillTrue and returns 0.
The second thing you'll notice is @test1 still exists. This is in case external
objects call @test1. The indirect branch is also removed (not sure by which
pass) and the code is simplified. But at the end of the day, the parameters of
@test1 are still not treated as constants, because they can't be.
Attached graph_states.jpg
(208705 bytes, image/jpeg): cfg graph picture
Hi again Matthias,
I've re-read your comment and realize the test-case is less important than the idea of attacking the problem it showcases for IndirectBr. Apologies for the explanation about function passes and JT. :)
I'm a bit confused by your proposed patch and why you think it fixes the problem of a missed opportunity. I've attached graph_states.jpg showing the original (left), JumpThreading today (top right), and your proposed patch (bottom right). Of the three the one in the top right shows the most threaded opportunities discovered which is what we have today. Using your proposed fix reduces the number of jumpthreaded blocks so I'm struggling to understand why you prefer the bottom-right CFG.
If you are talking about the edge path %start -> %condTrue -> %condFalse that cannot be threaded by JumpThreading today but it's not because of the IndirectBr terminator. It's because %condTrue is inside a loop (%start (header), %condTrue, %indirectA (latch)) and is exiting the loop to %condFalse on one path. I saw you mentioned that the problem doesn't occur when %indirectA isn't there (creating this loop).
Getting JT to comprehend loops is a non-trivial undertaking and is why I worked to add up-to-date dominance information to JumpThreading. There is still considerable work in preserving loops across all of JT, which isn't true today. Until that happens we cannot use LI to comprehend loop levels and attempt threading an edge into, or out of, a loop.
I hope I've understood your question so please let me know if I still don't get your point. :)
Hi Brian,
thanks a a lot for your support.
> did you make indirectbr2.ll by hand or was it emitted from clang?
It is a simplified example of what is generated by a compiler we have
implemented that uses the LLVM as a backend. I am not sure and haven't
evaluated if a similar IR could somehow be generated by Clang using C.
> The reason I ask is because the JumpThreading pass is very much tied to its
position when called in the pass manager.
> However, in conjunction with other passes, clang as a whole is able to find
the optimization at -O3.
> I'm a bit confused by your proposed patch and why you think it fixes the
problem of a missed opportunity.
This might be a misunderstanding as my bug report isn't about whether the
JumpThreading pass is able to simplify unneeded branches, my concern is that it
seems to change the behavior of the generated code or in other words the
optimization changes the program in a way that changes its behavior when
executed. This seems to be caused by wrong assumptions about whether an
expression result is known for specific paths in the CFG or not.
So in the example I would expect that the compiled program terminates with 0,
while after running the JumpThreading pass it will terminate with 1 because a
different BasicBlock is reached as the evaluation of %cond in the BasicBlock
"condFalse" is simplified to false which is correct for the direct predecessor
"start" but it is wrong for the predecessor "condTrue" which seems to be
undetected by the optimization step and results in the indirect branch
instruction to jump to "condStillFalse" while this should be "condStillTrue" or
not optimized at all.
The "Early return false" I implemented skips the JumpThreading optimization for
a BasicBlock that is reached by an indirect branch, so the optimization doesn't
trigger and doesn't cause the changed behavior.
I don't fully understand what causes the wrong optimization and where it
exactly occurs in the coding and I am not familiar with the coding.
Thanks once again and best regards,
Matthias
Attached testJT.sh
(2173 bytes, text/x-sh): Test script to show different behavior
Hello Matthias, I appreciate your the extended explanation. I did not initially notice the optimized JT version completely eliminated any chance of "ret i32 0". I will need to step through JT with GDB to see the IR shape, and state, that decided this thread was allowed when it should not be. I will do so and report back my findings.
Thank you for testJS.sh. I was able to reproduce the error with your shell script with a minor change: llc required the argument "-relocation-model=pic".
It's interesting, but not entirely surprising, the "clang -O3 -c indirectbr2.ll -emit-llvm -S -o out.ll" version is correct. I strongly suspect this bug only happens when you call via opt directly into JumpThreading. As I mentioned before, JT's position in the pass pipeline can mask bugs because certain code shapes are never emitted prior to JTs execution. I strongly suspect you will not be able to cause this error via clang and a c/c++ file.
Nevertheless, it is a bug, and an optimization is performed on the IR that should not be.
Attached graph_patch_PredWithKnownDest.jpg
(76448 bytes, image/jpeg): cfg graph picture with ++PredWithKnownDest patch
Hello Matthias, I did some tracing in GDB.
First I wanted to make sure the analysis by LVI was correct. It was:
ComputeValueKnownInPredecessors():
P = %condTrue
BB = %condFalse
LVI determines on this path %cond is true
P = %start
BB = %condFalse
LVI determines on this path %cond is false
We know we have two predecessors and we (correctly) know each predecessor has a
different value for %cond.
We return from ComputeValueKnownInPredecessors() and come back into
ProcessThreadableEdges().
We then enter the predecessor analysis loop:
for (const auto &PredValue : PredValues) {
...
And the first of our problems start:
// If we have exactly one destination, remember it for efficiency below.
if (PredToDestList.empty()) {
OnlyDest = DestBB;
OnlyVal = Val;
....
This is true on the first pass through, giving us the following state:
Pred = %condTrue
BB = %condFalse
DestBB = %condStillTrue
Val = true
PredToDestList is empty
OnlyDest = %condStillTrue
OnlyVal = true
Next, we increment PredWithKnownDest. This is the second of our problems.
On this iteration we hit the continue state: %condTrue ends with an IndirectBr.
The next iteration, we hit this check again:
// If we have exactly one destination, remember it for efficiency below.
if (PredToDestList.empty()) {
OnlyDest = DestBB;
OnlyVal = Val;
....
And we reassign OnlyDest and OnlyVal to our new state because PredToDestList is
still empty (because of the continue). Here's what it looks like now:
Pred = %start
BB = %condFalse
DestBB = %condStillFalse
Val = false
PredToDestList is STILL empty
OnlyDest = %condStillFalse
OnlyVal = false
We increment PredWithKnownDest again and then push the pair of (%start,
%condStillFalse) to PredToDestList.
Our state is now invalid. The early continue incremented PredWithKnownDest
twice but only inserted to PredWithKnownDest _once_.
Right after this loop we have the check to see if folding can occur:
if (OnlyDest && OnlyDest != MultipleDestSentinel) {
if (PredWithKnownDest == (size_t)pred_size(BB)) {
....
And sure enough we pass every one of those checks. We pre-saved OnlyDest to a
valid BB pointer and PredWithKnownDest == 2 which is the same as pred_size(BB).
We _incorrectly_ fold the false case, discarding the true path.
The goal here is to pessimize folding but still allow for JumpThreading of the
false path. We want to thread %start -> %condFalse -> %condStillFalse but leave
the %condTrue path alone.
A quick grep of the source shows PredWithKnownDest is being used as a proxy for
PredToDestList.size(). It's _only_ checked when attempting to fold.
With this small patch:
diff --git a/llvm/lib/Transforms/Scalar/JumpThreading.cpp
b/llvm/lib/Transforms/Scalar/JumpThreading.cpp
index 264ea3aa22a..3c6d10ba0d6 100644
--- a/llvm/lib/Transforms/Scalar/JumpThreading.cpp
+++ b/llvm/lib/Transforms/Scalar/JumpThreading.cpp
@@ -1643,15 +1643,15 @@ bool JumpThreadingPass::ProcessThreadableEdges(Value
*Cond, BasicBlock *BB,
OnlyVal = MultipleVal;
}
- // We know where this predecessor is going.
- ++PredWithKnownDest;
-
// If the predecessor ends with an indirect goto, we can't change its
// destination. Same for CallBr.
if (isa<IndirectBrInst>(Pred->getTerminator()) ||
isa<CallBrInst>(Pred->getTerminator()))
continue;
+ // We know where this predecessor is going.
+ ++PredWithKnownDest;
+
PredToDestList.push_back(std::make_pair(Pred, DestBB));
}
We correctly synchronize the count of PredWithKnownDest and the size of
PredToDestList. I could have called PredToDestList.size() directly but I don't
know the history here, maybe it's faster to count while looping and this is
expensive code.
This change produces the optimum correct JT result for your indirectBr.ll test.
It also does not break callbr-edge-split.ll.
I am about to run it on test-suite to see if there are any issues. If not I'll
open a Phabricator issue.
By the way, the reason we don't see this with clang -O3 is the indirectbr is
optimized out before the first call to jumpthreading occurs. Here's the state
of @test1() the first time JT see it:
Breakpoint 1, llvm::JumpThreadingPass::runImpl (this=0x6a42a0, F=...,
TLI_=0x6ac358,
LVI_=0x6a4660, AA_=0x72d950, DTU_=0x7fffffffa4f8, HasProfileData_=false,
BFI_=std::unique_ptr<llvm::BlockFrequencyInfo> = {...},
BPI_=std::unique_ptr<llvm::BranchProbabilityInfo> = {...})
at /work/b.rzycki/upstream/llvm-project/llvm/lib/Transforms/Scalar/JumpThreading.cpp:343
(gdb) p F.dump()
; Function Attrs: norecurse nounwind readnone
define i32 @test1(i1 %val1, i1 %val2, i1 %val3) local_unnamed_addr #0 {
entry:
br label %start
start: ; preds = %start, %entry
%cond = and i1 %val1, %val2
%brmerge.demorgan = and i1 %cond, %val3
br i1 %brmerge.demorgan, label %start, label %condFalse
condFalse: ; preds = %start
%not.cond = xor i1 %cond, true
%spec.select = zext i1 %not.cond to i32
ret i32 %spec.select
}
Another pass identified the indirectBr as unnecessary and converted it to a
standard conditional br.
Phabricator review created: https://reviews.llvm.org/D60284
(In reply to Brian Rzycki from comment #8)
> I strongly suspect you will not be able to cause this error via clang and a
c/c++
> file.
As it happens, you can, though it requires the use of the 'indirect goto' GNU
extension. We ran into this as part of upgrading our product's OS base to a
newer FreeBSD version, which uses clang 5.0.0. I verified that the error is
reproducible on clang 6.0.0 as well (FreeBSD 12's shipping compiler), and on
Apple's current XCode compiler ("Apple clang version 11.0.0").
Backporting this fix to clang 5.0.0 solved our miscompilation.
I'll attach a reduced test case. As Matthias had found, two indirect targets
are necessary, together with a boolean which takes different values along one
path with an indirect jump and a second path of normal execution.
Attached reduced.c
(518 bytes, text/plain): clang test case which reproduces the bug
indirectbr2.ll
(1648 bytes, text/plain)graph_states.jpg
(208705 bytes, image/jpeg)testJT.sh
(2173 bytes, text/x-sh)graph_patch_PredWithKnownDest.jpg
(76448 bytes, image/jpeg)reduced.c
(518 bytes, text/plain)