Closed Quuxplusone closed 8 years ago
(In reply to comment #0)
> The following Rust playpen snippet (possible to see IR or ASM interactively)
>
> http://is.gd/FrGoDm
I see that I can specify debug or release, but how do I mimic the variations
that you are testing (SSE, SSE2, etc)?
I copy/pasted the debug IR and ran it through opt, but I'm not seeing the same
optimized IR that shows up on that web page.
If you can attach the various IR cases, that would probably be the best way for
someone to investigate further. Thanks!
Right, playpen is still limited to just x86_64.
I'm attaching the respective IR's.
Attached IR_files.zip
(4699 bytes, application/zip): IR output
The closest to what you wanted to do would is available via the excellent
http://rust.godbolt.org/ site. (other languages are available too)
Even though it's still just x86_64 at the moment, all compiler options can be
changed interactively.
http://goo.gl/ozlAEr
I may be missing something: isn't the IR for your p3.ll and p4.ll cases identical for folds2()? If that's correct, then how do you explain the perf difference?
Apologies for being Rust illiterate: what flags do you specify on the godbolt site to see the difference between SSE1/SSE2?
The flags on 32-bit (godbolt seems to accept them but it shouldn't work on x86_64) would be these:
-C target-feature=-sse
to disable SSE or
-C target-cpu=pentium2
, etc.
As for the IR files, my apologies, it seems I just generated them for the main.rs and not the actual benchmarks. See the latest attachment.
Attached IR_bench_files.zip
(15608 bytes, application/zip): Benchmarks IR output
(In reply to comment #6)
The flags on 32-bit (godbolt seems to accept them but it shouldn't work on x86_64) would be these:
-C target-feature=-sse
to disable SSE or-C target-cpu=pentium2
, etc.
Great! I just took a quick look, but it seems that with SSE - but without SSE2 - the code is getting unrolled needlessly and harmfully. This would suggest that something in the unroller is off; there's no sign of vectorization with that config.
Note: I filed bug 26859 while looking at the codegen for the vectorized case because I thought there was a serious bug there, but that turned out to be false. At most, the use of a horizontal add op on a CPU that has a slow implementation would cost a few extra cycles, so that's not the main problem or possibly not a problem at all for this example. We should note for the record which CPU you're benchmarking on though.
Ah, yes - it's a Core2 T6600 @ 2.2 GHz in 32-bit mode.
When targeted (up to sse4.1), we get this result:
running 2 tests
test folds1 ... bench: 249 ns/iter (+/- 0)
test folds2 ... bench: 499 ns/iter (+/- 2)
which for the first time manages to vectorise folds2, too.
Attached core2.ll
(38743 bytes, application/octet-stream): Core2 IR, for completeness
(In reply to comment #8)
> Great! I just took a quick look, but it seems that with SSE - but without
> SSE2 - the code is getting unrolled needlessly and harmfully. This would
> suggest that something in the unroller is off; there's no sign of
> vectorization with that config.
While there was no sign of vectorization in the asm, the IR was in fact
vectorized. So that's the problem for the SSE1 case: the vectorizer believes
that vectorization of integer math is useful with any SSE available. So when
the backend gets the integer vector IR for a target with no SSE2, it has to
legalize all of that as scalar ops which looks like an insane unrolling effort.
So I think the bug is in the vectorizer: it shouldn't try to vectorize integer
math without SSE2.
For reference and since I'm a C dinosaur: this should be the C equivalent of
"folds1":
int accum(int *x, int N) {
int sum = 0;
for (int i=0; i<N; ++i)
sum += x[i];
return sum;
}
Yup, thanks for reminding everyone C is still the golden standard Rust is trying to improve on. And besides, with Moore's law slowing down, the dinosaurs will rule again! ;)
So this is quite an interesting problem...the vectorizer's cost model is
actually very good here: it *knows* that integer vector math is illegal for the
SSE1 target and still decides to vectorize using <4 x i32> and an interleave
factor of 2 (that's similar to an unroll factor):
LV: Found an estimated cost of 1 for VF 1 For instruction: %add = add nsw i32
%0, %sum.02
LV: Found an estimated cost of 4 for VF 4 For instruction: %add = add nsw i32
%0, %sum.02
The cost equation is tipped in favor of vectorization because vectorization
reduces the per-loop overhead of the loop increment and comparison.
The thing that makes the final codegen look crazy is, in fact, the loop
unroller! It takes the vectorized loop with 2 adds and fattens that up by a
factor of 4 (because it doesn't know/care that the vector ops are illegal). And
still, this is not really *that* wrong - there's no spilling.
So we make a series of locally correct decisions that end up producing a big
pile of code that ends up being a perf loser (at least on Penryn).
I'm not sure what the answer should be. cc'ing some vectorizer folks.
Here's the IR test case for an SSE1 target. Test with:
$ opt -basicaa -loop-vectorize -instcombine -loop-unroll accum2.ll -S
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.11.0"
define i32 @accum(i32* nocapture readonly %x, i32 %N) #0 {
entry:
%cmp1 = icmp sgt i32 %N, 0
br i1 %cmp1, label %for.inc.preheader, label %for.end
for.inc.preheader:
br label %for.inc
for.inc:
%indvars.iv = phi i64 [ %indvars.iv.next, %for.inc ], [ 0, %for.inc.preheader ]
%sum.02 = phi i32 [ %add, %for.inc ], [ 0, %for.inc.preheader ]
%arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv
%0 = load i32, i32* %arrayidx, align 4
%add = add nsw i32 %0, %sum.02
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %N
br i1 %exitcond, label %for.end.loopexit, label %for.inc
for.end.loopexit:
%add.lcssa = phi i32 [ %add, %for.inc ]
br label %for.end
for.end:
%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add.lcssa, %for.end.loopexit ]
ret i32 %sum.0.lcssa
}
attributes #0 = { "target-cpu"="core2" "target-features"="+sse,-avx,-avx2,-
sse2" }
So that it's clear what the extremes are for this case...this is the simple
unrolled scalar loop:
LBB0_1:
addl (%rdi), %eax
addq $4, %rdi
decl %esi
jne LBB0_1
And this is the main loop body with vectorization and unrolling for the machine
that doesn't have SSE2:
LBB0_9:
addl -100(%rdx), %ecx
addl -104(%rdx), %r11d
addl -108(%rdx), %ebp
addl -112(%rdx), %r10d
addl -84(%rdx), %eax
addl -88(%rdx), %ebx
addl -92(%rdx), %r14d
addl -96(%rdx), %r13d
addl -80(%rdx), %r10d
addl -76(%rdx), %ebp
addl -72(%rdx), %r11d
addl -68(%rdx), %ecx
addl -64(%rdx), %r13d
addl -60(%rdx), %r14d
addl -56(%rdx), %ebx
addl -52(%rdx), %eax
addl -36(%rdx), %ecx
addl -40(%rdx), %r11d
addl -44(%rdx), %ebp
addl -48(%rdx), %r10d
addl -20(%rdx), %eax
addl -24(%rdx), %ebx
addl -28(%rdx), %r14d
addl -32(%rdx), %r13d
addl -16(%rdx), %r10d
addl -12(%rdx), %ebp
addl -8(%rdx), %r11d
addl -4(%rdx), %ecx
addl (%rdx), %r13d
addl 4(%rdx), %r14d
addl 8(%rdx), %ebx
addl 12(%rdx), %eax
subq $-128, %rdx
addq $-32, %r12
jne LBB0_9
That's bad. It seems that our 'user cost' cost model, which we use for unrolling (and inlining) does not account for scalarization costs? If so, that certainly needs to be fixed.
(In reply to comment #16)
> That's bad. It seems that our 'user cost' cost model, which we use for
> unrolling (and inlining) does not account for scalarization costs? If so,
> that certainly needs to be fixed.
I'll double check that to be sure, but that's what I saw when stepping through.
getOperationCost() in TargetTransformInfoImpl doesn't distinguish based on type
(or much of anything really!), so everything is cost = TCC_Basic. I never hit
my breakpoint in getVectorInstrCost().
(In reply to comment #17)
> I'll double check that to be sure, but that's what I saw when stepping
> through. getOperationCost() in TargetTransformInfoImpl doesn't distinguish
> based on type (or much of anything really!), so everything is cost =
> TCC_Basic. I never hit my breakpoint in getVectorInstrCost().
I've confirmed that. Here's a simpler test case to show the problem. The
"NumInsts" reported by CodeMetrics is "6". Each of gep, vecload, vecadd, add,
icmp, br are all cost of "1" (TCC_Basic) based on TTI.getUserCost().
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.11.0"
define <4 x i32> @unroll_illegal_vector_ops(<4 x i32>* nocapture readonly %x,
i64 %N) #0 {
entry:
%cmp1 = icmp sgt i64 %N, 0
br i1 %cmp1, label %vector.body, label %end
vector.body:
%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
%vec.phi = phi <4 x i32> [ zeroinitializer, %entry ], [ %add, %vector.body ]
%gep = getelementptr inbounds <4 x i32>, <4 x i32>* %x, i64 %index
%load = load <4 x i32>, <4 x i32>* %gep, align 4
%add = add nsw <4 x i32> %load, %vec.phi
%index.next = add i64 %index, 1
%cmp2 = icmp eq i64 %index.next, %N
br i1 %cmp2, label %end, label %vector.body
end:
%sum = phi <4 x i32> [ zeroinitializer, %entry ], [ %add, %vector.body ]
ret <4 x i32> %sum
}
attributes #0 = { "target-cpu"="core2" "target-features"="+sse,-avx,-
avx2,+sse2" }
To fix this, we need to unify the cost models? It seems like we're going to need LoopVectorizationCostModel::getInstructionCost() or something very close to it.
Sanjay, I don't understand your suggestion. The cost model is designed to handle the cost of vectors and scalars. This is how the vectorizer is able to decide if it is profitable to vectorize. It sounds like we need to fix the Basic cost model (which is the default cost model if no hardware information is available).
(In reply to comment #20)
> Sanjay, I don't understand your suggestion. The cost model is designed to
> handle the cost of vectors and scalars. This is how the vectorizer is able
> to decide if it is profitable to vectorize. It sounds like we need to fix
> the Basic cost model (which is the default cost model if no hardware
> information is available).
I haven't thought this through, so it might be completely wrong. :)
If we look at LoopVectorizationCostModel::getInstructionCost() :
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/Vectorize/LoopVectorize.cpp#L5645
We have a switch over all of the instruction opcodes, and each opcode uses the
approriate TTI.get***Cost() function to calculate the cost. That's very good!
It is of course specialized for the vectorizer because it's also taking into
account the interleave factor.
Now if we look at TTI's getOperationCost() which is called by getUserCost():
https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/TargetTransformInfoImpl.h#L47
The problem is that this isn't specialized at all to use the target hooks like
the vectorizer. In order to calculate costs correctly (for example, when a
vector type is not legal), won't we end up using all of the same hooks that the
vectorizer is currently using?
(In reply to comment #19)
> To fix this, we need to unify the cost models? It seems like we're going to
> need LoopVectorizationCostModel::getInstructionCost() or something very
> close to it.
I think that we should unify the cost models, but we need to be careful because
the two cost models don't compute the same thing. The vectorization cost model
is mainly concerned with throughput estimates, whereas the cost models used by
the inliner and unroller are more concerned with code size (or, looking at it a
different way, the number of uops). Both do need to handle scalarization,
however, and need a lot of the same basic inputs.
From my perspective, the cost models are more the same than different (on a
simple in-order core, they might be *exactly* the same), but we definitely need
to preserve the distinctions.
(In reply to comment #22)
> From my perspective, the cost models are more the same than different (on a
> simple in-order core, they might be *exactly* the same), but we definitely
> need to preserve the distinctions.
Understood. I'll start by using the vectorizer's switch as a model and copying
liberally. The refactoring potential should become more obvious as I go along.
(In reply to comment #23)
> Understood. I'll start by using the vectorizer's switch as a model and
> copying liberally.
Well, that was too ambitious...
Even a simple patch as pasted below (and which actually does reasonably well on
the test case for this bug report - the unroll factor is reduced from '4' to
'2') causes regression test failures in:
LLVM :: Transforms/BBVectorize/X86/loop1.ll
LLVM :: Transforms/BBVectorize/loop1.ll
LLVM :: Transforms/CodeGenPrepare/X86/select.ll
LLVM :: Transforms/SimplifyCFG/speculate-math.ll
Can we nuke BBVectorize yet? :)
One of the problems arises from inconsistency about FP op costs:
// Assume that floating point arithmetic operations cost twice as much as
// integer operations.
unsigned OpCost = (IsFloat ? 2 : 1);
...but somewhere else:
unsigned getFPOpCost(Type *Ty) {
// By default, FP instructions are no more expensive since they are
// implemented in HW. Target specific TTI can override this.
return TargetTransformInfo::TCC_Basic;
}
...but somewhere else again:
case Instruction::FDiv:
return TTI::TCC_Expensive;
When I go to straighten this out, more tests fail. It will take a bit of work
to untangle everything.
Index: include/llvm/Analysis/TargetTransformInfoImpl.h
===================================================================
--- include/llvm/Analysis/TargetTransformInfoImpl.h (revision 262928)
+++ include/llvm/Analysis/TargetTransformInfoImpl.h (working copy)
@@ -506,6 +506,25 @@
return TTI::TCC_Free;
}
+ if (const auto *BO = dyn_cast<BinaryOperator>(U)) {
+ return static_cast<T *>(this)->getArithmeticInstrCost(
+ BO->getOpcode(), BO->getType(), TargetTransformInfo::OK_AnyValue,
+ TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None,
+ TargetTransformInfo::OP_None);
+ }
+
+ if (const auto *Load = dyn_cast<LoadInst>(U)) {
+ return static_cast<T *>(this)->getMemoryOpCost(
+ Load->getOpcode(), Load->getType(), Load->getAlignment(),
+ Load->getPointerAddressSpace());
+ }
+
+ if (const auto *Store = dyn_cast<StoreInst>(U)) {
+ return static_cast<T *>(this)->getMemoryOpCost(
+ Store->getOpcode(), Store->getValueOperand()->getType(),
+ Store->getAlignment(), Store->getPointerAddressSpace());
+ }
+
return static_cast<T *>(this)->getOperationCost(
Operator::getOpcode(U), U->getType(),
U->getNumOperands() == 1 ? U->getOperand(0)->getType() : nullptr);
Somewhat related question: would it make sense to have both costs: code-size cost and execution-time cost? I encountered similar issue when worked on unroller - we want to reason about tradeoff between code-size increase and execution-time improvement, but we estimate both of them with the same cost.
(In reply to comment #24)
> (In reply to comment #23)
> > Understood. I'll start by using the vectorizer's switch as a model and
> > copying liberally.
>
> Well, that was too ambitious...
>
> Even a simple patch as pasted below (and which actually does reasonably well
> on the test case for this bug report - the unroll factor is reduced from '4'
> to '2') causes regression test failures in:
> LLVM :: Transforms/BBVectorize/X86/loop1.ll
> LLVM :: Transforms/BBVectorize/loop1.ll
> LLVM :: Transforms/CodeGenPrepare/X86/select.ll
> LLVM :: Transforms/SimplifyCFG/speculate-math.ll
>
> Can we nuke BBVectorize yet? :)
>
> One of the problems arises from inconsistency about FP op costs:
>
> // Assume that floating point arithmetic operations cost twice as much as
> // integer operations.
> unsigned OpCost = (IsFloat ? 2 : 1);
This is a hack that we should have gotten rid of long ago. It is not even close
to true on many targets (including some of mine). If this is reasonable for any
given set of target configurations, this logic belongs in those targets.
>
> ...but somewhere else:
>
> unsigned getFPOpCost(Type *Ty) {
> // By default, FP instructions are no more expensive since they are
> // implemented in HW. Target specific TTI can override this.
> return TargetTransformInfo::TCC_Basic;
> }
>
> ...but somewhere else again:
>
> case Instruction::FDiv:
> return TTI::TCC_Expensive;
>
> When I go to straighten this out, more tests fail. It will take a bit of
> work to untangle everything.
Interesting; this is probably fast-math-dependent, because fdivs are much
cheaper when we can use the reciprocal estimates. Regardless, estimating the
size of the resulting code sequence will require some dedicated logic.
>
...
(In reply to comment #25)
> Somewhat related question: would it make sense to have both costs: code-size
> cost and execution-time cost? I encountered similar issue when worked on
> unroller - we want to reason about tradeoff between code-size increase and
> execution-time improvement, but we estimate both of them with the same cost.
Ideally, I think we need four costs:
1. Throughput [what the vectorizer currently uses]
2. Latency [what the vectorizer really should use to determine its interleaving factor]
3. Size [what the inliner and full unroller should use, at least most of the time]
4. uop Count [what should be used for partial unrolling, at least in x86 cores with an LSD]
In the short term, in addition to the rationalization being investigated, I
think that we should group what we have behind a common interface. It is
plausible that (3) and (4) are sufficiently correlated to not require a
distinction. Thoughts?
(In reply to comment #27)
> (In reply to comment #25)
> > Somewhat related question: would it make sense to have both costs: code-size
> > cost and execution-time cost? I encountered similar issue when worked on
> > unroller - we want to reason about tradeoff between code-size increase and
> > execution-time improvement, but we estimate both of them with the same cost.
>
> Ideally, I think we need four costs:
>
> 1. Throughput [what the vectorizer currently uses]
> 2. Latency [what the vectorizer really should use to determine its
> interleaving factor]
> 3. Size [what the inliner and full unroller should use, at least most of
> the time]
> 4. uop Count [what should be used for partial unrolling, at least in x86
> cores with an LSD]
>
> In the short term, in addition to the rationalization being investigated, I
> think that we should group what we have behind a common interface. It is
> plausible that (3) and (4) are sufficiently correlated to not require a
> distinction. Thoughts?
Yes, I like this idea. At least for me, the interface itself is confusing
because it's not explicit about exactly which "cost" I am getting.
I think there may even be some finer distinctions that we want to surface from
the cost model, and this bug report raises it.
If we know, for example, that a vector operation does not exist for the target,
we probably shouldn't vectorize the IR regardless of whether the cost model
shows a win. Effectively, the vectorizer is encroaching on the unroller's job
because we know that legalization is going to have to "unroll" the vector op
for us.
For the case shown here, we're harming the unroller because the vectorizer has
created its own unroll factor. This limits the multipliers that the unroller
can use. AFAICT, there is no way that the unroller can overcome this currently.
If the cost model told the unroller that the vector add was illegal, it could
take that into account, although I would think it's still better that the
vectorizer didn't fire in the first place for this particular example.
(In reply to comment #28)
> (In reply to comment #27)
> > (In reply to comment #25)
> > > Somewhat related question: would it make sense to have both costs: code-
size
> > > cost and execution-time cost? I encountered similar issue when worked on
> > > unroller - we want to reason about tradeoff between code-size increase and
> > > execution-time improvement, but we estimate both of them with the same
cost.
> >
> > Ideally, I think we need four costs:
> >
> > 1. Throughput [what the vectorizer currently uses]
> > 2. Latency [what the vectorizer really should use to determine its
> > interleaving factor]
> > 3. Size [what the inliner and full unroller should use, at least most of
> > the time]
> > 4. uop Count [what should be used for partial unrolling, at least in x86
> > cores with an LSD]
> >
> > In the short term, in addition to the rationalization being investigated, I
> > think that we should group what we have behind a common interface. It is
> > plausible that (3) and (4) are sufficiently correlated to not require a
> > distinction. Thoughts?
>
> Yes, I like this idea. At least for me, the interface itself is confusing
> because it's not explicit about exactly which "cost" I am getting.
>
> I think there may even be some finer distinctions that we want to surface
> from the cost model, and this bug report raises it.
>
> If we know, for example, that a vector operation does not exist for the
> target, we probably shouldn't vectorize the IR regardless of whether the
> cost model shows a win. Effectively, the vectorizer is encroaching on the
> unroller's job because we know that legalization is going to have to
> "unroll" the vector op for us.
I think we need to be careful here, because for large loops, even if we have to
scalarize some small piece of it, the vectorization overall can be a huge win.
On some architectures, scalarization is not actually all that expensive. On
others, especially where it involves trips through the stack, it can kill you.
We might, however, want to be a little more conservative for edge cases (at
least once our cost model is more sane).
>
> For the case shown here, we're harming the unroller because the vectorizer
> has created its own unroll factor. This limits the multipliers that the
> unroller can use. AFAICT, there is no way that the unroller can overcome
> this currently.
>
> If the cost model told the unroller that the vector add was illegal, it
> could take that into account, although I would think it's still better that
> the vectorizer didn't fire in the first place for this particular example.
(In reply to comment #27)
> (In reply to comment #25)
> > Somewhat related question: would it make sense to have both costs: code-size
> > cost and execution-time cost? I encountered similar issue when worked on
> > unroller - we want to reason about tradeoff between code-size increase and
> > execution-time improvement, but we estimate both of them with the same cost.
>
> Ideally, I think we need four costs:
>
> 1. Throughput [what the vectorizer currently uses]
> 2. Latency [what the vectorizer really should use to determine its
> interleaving factor]
> 3. Size [what the inliner and full unroller should use, at least most of
> the time]
> 4. uop Count [what should be used for partial unrolling, at least in x86
> cores with an LSD]
>
> In the short term, in addition to the rationalization being investigated, I
> think that we should group what we have behind a common interface. It is
> plausible that (3) and (4) are sufficiently correlated to not require a
> distinction. Thoughts?
Yeah, having these data would be very valuable for a number of passes. But it's
important to remember that it's still just a rough estimate, so I don't know
how far we should go in providing details about it.
For instance, latency and throughput are quite a low-level attributes, they
obviously apply well for actual assembler instructions, but using them for IR
instructions might be a bit inaccurate. My point is that while we do need more
granular costs, I'm not completely sure that we need to model (or pretend that
we're modeling) such properties as 'latency', 'throughput', and esp. 'uop
Count'. What do you think?
Hi everyone,
The piece of code in question seems to be jinxed - could someone have a look at issue 26882 or the link below to see why it makes llvm crash?
Thx
@PeteVine, can you please attache the LLVM IR and a stack trace?
If the code crashes then it would be a good idea to file a new bug report. Please CC me on the new bug report. I'd like to try to help.
http://reviews.llvm.org/D18537
Looks like that patch will prevent the unwanted vectorization for this test
case.
Thanks, Hal!
Note that the unroller will simply stamp out the same (dependent) op over and
over:
addl -28(%rcx), %eax
addl -24(%rcx), %eax
addl -20(%rcx), %eax
...
So we get no real parallelization after that patch, but I don't think that
really matters here.
In theory, the "Pentium3 (+sse)" perf case from the original problem
description should be the same as "Pentium2" now.
Thanks, next time I've built rustc it will have been with this patch. I'll let you know about the result in a few days.
(In reply to comment #34)
> Thanks, next time I've built rustc it will have been with this patch. I'll
> let you know about the result in a few days.
Committed (with Sanjay's test case from Comment 14) r264904.
Yes, fully fixed, as predicted.
IR_files.zip
(4699 bytes, application/zip)IR_bench_files.zip
(15608 bytes, application/zip)core2.ll
(38743 bytes, application/octet-stream)