Availability of SSE detrimental to code performance

Quuxplusone commented 8 years ago


Bugzilla Link	PR26837
Status	RESOLVED FIXED
Importance	P normal
Reported by	PeteVine (tulipawn@gmail.com)
Reported on	2016-03-04 06:58:08 -0800
Last modified on	2016-04-01 10:21:36 -0700
Version	3.8
Hardware	PC Linux
CC	congh@google.com, hfinkel@anl.gov, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, michael.v.zolotukhin@gmail.com, nadav.rotem@me.com, spatel+llvm@rotateright.com
Fixed by commit(s)
Attachments	`IR_files.zip` (4699 bytes, application/zip) `IR_bench_files.zip` (15608 bytes, application/zip) `core2.ll` (38743 bytes, application/octet-stream)
Blocks
Blocked by
See also	PR26859, PR26881

The following Rust playpen snippet (possible to see IR or ASM interactively)

http://is.gd/FrGoDm

produces the following benchmark results, depending on codegen options:

Pentium2

test folds1 ... bench:         736 ns/iter (+/- 5)
test folds2 ... bench:         731 ns/iter (+/- 1)

Pentium3 (or +sse)

test folds1 ... bench:         932 ns/iter (+/- 2)
test folds2 ... bench:         933 ns/iter (+/- 8)

Pentium4 (or +sse2)

running 2 tests
test folds1 ... bench:         249 ns/iter (+/- 1)
test folds2 ... bench:         733 ns/iter (+/- 6)

It seems availability of SSE2 enables successful vectorisation, whereas of SSE
makes the code slower. The latter shouldn't happen.

Quuxplusone commented 8 years ago

(In reply to comment #0)
> The following Rust playpen snippet (possible to see IR or ASM interactively)
>
> http://is.gd/FrGoDm

I see that I can specify debug or release, but how do I mimic the variations
that you are testing (SSE, SSE2, etc)?

I copy/pasted the debug IR and ran it through opt, but I'm not seeing the same
optimized IR that shows up on that web page.

If you can attach the various IR cases, that would probably be the best way for
someone to investigate further. Thanks!

Quuxplusone commented 8 years ago

Right, playpen is still limited to just x86_64.
I'm attaching the respective IR's.

Quuxplusone commented 8 years ago

Attached IR_files.zip (4699 bytes, application/zip): IR output

Quuxplusone commented 8 years ago

The closest to what you wanted to do would is available via the excellent
http://rust.godbolt.org/ site. (other languages are available too)

Even though it's still just x86_64 at the moment, all compiler options can be
changed interactively.

http://goo.gl/ozlAEr

Quuxplusone commented 8 years ago

I may be missing something: isn't the IR for your p3.ll and p4.ll cases identical for folds2()? If that's correct, then how do you explain the perf difference?

Apologies for being Rust illiterate: what flags do you specify on the godbolt site to see the difference between SSE1/SSE2?

Quuxplusone commented 8 years ago

The flags on 32-bit (godbolt seems to accept them but it shouldn't work on x86_64) would be these:

-C target-feature=-sse to disable SSE or -C target-cpu=pentium2, etc.

As for the IR files, my apologies, it seems I just generated them for the main.rs and not the actual benchmarks. See the latest attachment.

Quuxplusone commented 8 years ago

Attached IR_bench_files.zip (15608 bytes, application/zip): Benchmarks IR output

Quuxplusone commented 8 years ago

(In reply to comment #6)

The flags on 32-bit (godbolt seems to accept them but it shouldn't work on x86_64) would be these:

-C target-feature=-sse to disable SSE or -C target-cpu=pentium2, etc.

Great! I just took a quick look, but it seems that with SSE - but without SSE2 - the code is getting unrolled needlessly and harmfully. This would suggest that something in the unroller is off; there's no sign of vectorization with that config.

Note: I filed bug 26859 while looking at the codegen for the vectorized case because I thought there was a serious bug there, but that turned out to be false. At most, the use of a horizontal add op on a CPU that has a slow implementation would cost a few extra cycles, so that's not the main problem or possibly not a problem at all for this example. We should note for the record which CPU you're benchmarking on though.

Quuxplusone commented 8 years ago

Ah, yes - it's a Core2 T6600 @ 2.2 GHz in 32-bit mode.
When targeted (up to sse4.1), we get this result:

running 2 tests
test folds1 ... bench:         249 ns/iter (+/- 0)
test folds2 ... bench:         499 ns/iter (+/- 2)

which for the first time manages to vectorise folds2, too.

Quuxplusone commented 8 years ago

Attached core2.ll (38743 bytes, application/octet-stream): Core2 IR, for completeness

Quuxplusone commented 8 years ago

(In reply to comment #8)
> Great! I just took a quick look, but it seems that with SSE - but without
> SSE2 - the code is getting unrolled needlessly and harmfully. This would
> suggest that something in the unroller is off; there's no sign of
> vectorization with that config.

While there was no sign of vectorization in the asm, the IR was in fact
vectorized. So that's the problem for the SSE1 case: the vectorizer believes
that vectorization of integer math is useful with any SSE available. So when
the backend gets the integer vector IR for a target with no SSE2, it has to
legalize all of that as scalar ops which looks like an insane unrolling effort.

So I think the bug is in the vectorizer: it shouldn't try to vectorize integer
math without SSE2.

Quuxplusone commented 8 years ago

For reference and since I'm a C dinosaur: this should be the C equivalent of
"folds1":

int accum(int *x, int N) {
  int sum = 0;
  for (int i=0; i<N; ++i)
    sum += x[i];
  return sum;
}

Quuxplusone commented 8 years ago

Yup, thanks for reminding everyone C is still the golden standard Rust is trying to improve on. And besides, with Moore's law slowing down, the dinosaurs will rule again! ;)

Quuxplusone commented 8 years ago

So this is quite an interesting problem...the vectorizer's cost model is
actually very good here: it *knows* that integer vector math is illegal for the
SSE1 target and still decides to vectorize using <4 x i32> and an interleave
factor of 2 (that's similar to an unroll factor):

LV: Found an estimated cost of 1 for VF 1 For instruction:   %add = add nsw i32
%0, %sum.02
LV: Found an estimated cost of 4 for VF 4 For instruction:   %add = add nsw i32
%0, %sum.02

The cost equation is tipped in favor of vectorization because vectorization
reduces the per-loop overhead of the loop increment and comparison.

The thing that makes the final codegen look crazy is, in fact, the loop
unroller! It takes the vectorized loop with 2 adds and fattens that up by a
factor of 4 (because it doesn't know/care that the vector ops are illegal). And
still, this is not really *that* wrong - there's no spilling.

So we make a series of locally correct decisions that end up producing a big
pile of code that ends up being a perf loser (at least on Penryn).

I'm not sure what the answer should be. cc'ing some vectorizer folks.

Here's the IR test case for an SSE1 target. Test with:
$ opt -basicaa -loop-vectorize -instcombine -loop-unroll accum2.ll -S

target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.11.0"

define i32 @accum(i32* nocapture readonly %x, i32 %N) #0 {
entry:
  %cmp1 = icmp sgt i32 %N, 0
  br i1 %cmp1, label %for.inc.preheader, label %for.end

for.inc.preheader:
  br label %for.inc

for.inc:
  %indvars.iv = phi i64 [ %indvars.iv.next, %for.inc ], [ 0, %for.inc.preheader ]
  %sum.02 = phi i32 [ %add, %for.inc ], [ 0, %for.inc.preheader ]
  %arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv
  %0 = load i32, i32* %arrayidx, align 4
  %add = add nsw i32 %0, %sum.02
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %lftr.wideiv = trunc i64 %indvars.iv.next to i32
  %exitcond = icmp eq i32 %lftr.wideiv, %N
  br i1 %exitcond, label %for.end.loopexit, label %for.inc

for.end.loopexit:
  %add.lcssa = phi i32 [ %add, %for.inc ]
  br label %for.end

for.end:
  %sum.0.lcssa = phi i32 [ 0, %entry ], [ %add.lcssa, %for.end.loopexit ]
  ret i32 %sum.0.lcssa
}

attributes #0 = { "target-cpu"="core2" "target-features"="+sse,-avx,-avx2,-
sse2" }

Quuxplusone commented 8 years ago

So that it's clear what the extremes are for this case...this is the simple
unrolled scalar loop:

LBB0_1:
    addl    (%rdi), %eax
    addq    $4, %rdi
    decl    %esi
    jne LBB0_1

And this is the main loop body with vectorization and unrolling for the machine
that doesn't have SSE2:

LBB0_9:
    addl    -100(%rdx), %ecx
    addl    -104(%rdx), %r11d
    addl    -108(%rdx), %ebp
    addl    -112(%rdx), %r10d
    addl    -84(%rdx), %eax
    addl    -88(%rdx), %ebx
    addl    -92(%rdx), %r14d
    addl    -96(%rdx), %r13d
    addl    -80(%rdx), %r10d
    addl    -76(%rdx), %ebp
    addl    -72(%rdx), %r11d
    addl    -68(%rdx), %ecx
    addl    -64(%rdx), %r13d
    addl    -60(%rdx), %r14d
    addl    -56(%rdx), %ebx
    addl    -52(%rdx), %eax
    addl    -36(%rdx), %ecx
    addl    -40(%rdx), %r11d
    addl    -44(%rdx), %ebp
    addl    -48(%rdx), %r10d
    addl    -20(%rdx), %eax
    addl    -24(%rdx), %ebx
    addl    -28(%rdx), %r14d
    addl    -32(%rdx), %r13d
    addl    -16(%rdx), %r10d
    addl    -12(%rdx), %ebp
    addl    -8(%rdx), %r11d
    addl    -4(%rdx), %ecx
    addl    (%rdx), %r13d
    addl    4(%rdx), %r14d
    addl    8(%rdx), %ebx
    addl    12(%rdx), %eax
    subq    $-128, %rdx
    addq    $-32, %r12
    jne LBB0_9

Quuxplusone commented 8 years ago

That's bad. It seems that our 'user cost' cost model, which we use for unrolling (and inlining) does not account for scalarization costs? If so, that certainly needs to be fixed.

Quuxplusone commented 8 years ago

(In reply to comment #16)
> That's bad. It seems that our 'user cost' cost model, which we use for
> unrolling (and inlining) does not account for scalarization costs? If so,
> that certainly needs to be fixed.

I'll double check that to be sure, but that's what I saw when stepping through.
getOperationCost() in TargetTransformInfoImpl doesn't distinguish based on type
(or much of anything really!), so everything is cost = TCC_Basic. I never hit
my breakpoint in getVectorInstrCost().

Quuxplusone commented 8 years ago

(In reply to comment #17)
> I'll double check that to be sure, but that's what I saw when stepping
> through. getOperationCost() in TargetTransformInfoImpl doesn't distinguish
> based on type (or much of anything really!), so everything is cost =
> TCC_Basic. I never hit my breakpoint in getVectorInstrCost().

I've confirmed that. Here's a simpler test case to show the problem. The
"NumInsts" reported by CodeMetrics is "6". Each of gep, vecload, vecadd, add,
icmp, br are all cost of "1" (TCC_Basic) based on TTI.getUserCost().

target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.11.0"

define <4 x i32> @unroll_illegal_vector_ops(<4 x i32>* nocapture readonly %x,
i64 %N) #0 {
entry:
  %cmp1 = icmp sgt i64 %N, 0
  br i1 %cmp1, label %vector.body, label %end

vector.body:
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
  %vec.phi = phi <4 x i32> [ zeroinitializer, %entry ], [ %add, %vector.body ]
  %gep = getelementptr inbounds <4 x i32>, <4 x i32>* %x, i64 %index
  %load = load <4 x i32>, <4 x i32>* %gep, align 4
  %add = add nsw <4 x i32> %load, %vec.phi
  %index.next = add i64 %index, 1
  %cmp2 = icmp eq i64 %index.next, %N
  br i1 %cmp2, label %end, label %vector.body

end:
  %sum = phi <4 x i32> [ zeroinitializer, %entry ], [ %add, %vector.body ]
  ret <4 x i32> %sum
}

attributes #0 = { "target-cpu"="core2" "target-features"="+sse,-avx,-
avx2,+sse2" }

Quuxplusone commented 8 years ago

To fix this, we need to unify the cost models? It seems like we're going to need LoopVectorizationCostModel::getInstructionCost() or something very close to it.

Quuxplusone commented 8 years ago

Sanjay, I don't understand your suggestion. The cost model is designed to handle the cost of vectors and scalars. This is how the vectorizer is able to decide if it is profitable to vectorize. It sounds like we need to fix the Basic cost model (which is the default cost model if no hardware information is available).

Quuxplusone commented 8 years ago

(In reply to comment #20)
> Sanjay, I don't understand your suggestion. The cost model is designed to
> handle the cost of vectors and scalars. This is how the vectorizer is able
> to decide if it is profitable to vectorize. It sounds like we need to fix
> the Basic cost model (which is the default cost model if no hardware
> information is available).

I haven't thought this through, so it might be completely wrong. :)

If we look at LoopVectorizationCostModel::getInstructionCost() :
https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/Vectorize/LoopVectorize.cpp#L5645

We have a switch over all of the instruction opcodes, and each opcode uses the
approriate TTI.get***Cost() function to calculate the cost. That's very good!
It is of course specialized for the vectorizer because it's also taking into
account the interleave factor.

Now if we look at TTI's getOperationCost() which is called by getUserCost():

https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/TargetTransformInfoImpl.h#L47

The problem is that this isn't specialized at all to use the target hooks like
the vectorizer. In order to calculate costs correctly (for example, when a
vector type is not legal), won't we end up using all of the same hooks that the
vectorizer is currently using?

Quuxplusone commented 8 years ago

(In reply to comment #19)
> To fix this, we need to unify the cost models? It seems like we're going to
> need LoopVectorizationCostModel::getInstructionCost() or something very
> close to it.

I think that we should unify the cost models, but we need to be careful because
the two cost models don't compute the same thing. The vectorization cost model
is mainly concerned with throughput estimates, whereas the cost models used by
the inliner and unroller are more concerned with code size (or, looking at it a
different way, the number of uops). Both do need to handle scalarization,
however, and need a lot of the same basic inputs.

From my perspective, the cost models are more the same than different (on a
simple in-order core, they might be *exactly* the same), but we definitely need
to preserve the distinctions.

Quuxplusone commented 8 years ago

(In reply to comment #22)
> From my perspective, the cost models are more the same than different (on a
> simple in-order core, they might be *exactly* the same), but we definitely
> need to preserve the distinctions.

Understood. I'll start by using the vectorizer's switch as a model and copying
liberally. The refactoring potential should become more obvious as I go along.

Quuxplusone commented 8 years ago

(In reply to comment #23)
> Understood. I'll start by using the vectorizer's switch as a model and
> copying liberally.

Well, that was too ambitious...

Even a simple patch as pasted below (and which actually does reasonably well on
the test case for this bug report - the unroll factor is reduced from '4' to
'2') causes regression test failures in:
    LLVM :: Transforms/BBVectorize/X86/loop1.ll
    LLVM :: Transforms/BBVectorize/loop1.ll
    LLVM :: Transforms/CodeGenPrepare/X86/select.ll
    LLVM :: Transforms/SimplifyCFG/speculate-math.ll

Can we nuke BBVectorize yet? :)

One of the problems arises from inconsistency about FP op costs:

    // Assume that floating point arithmetic operations cost twice as much as
    // integer operations.
    unsigned OpCost = (IsFloat ? 2 : 1);

...but somewhere else:

  unsigned getFPOpCost(Type *Ty) {
    // By default, FP instructions are no more expensive since they are
    // implemented in HW.  Target specific TTI can override this.
    return TargetTransformInfo::TCC_Basic;
  }

...but somewhere else again:

    case Instruction::FDiv:
      return TTI::TCC_Expensive;

When I go to straighten this out, more tests fail. It will take a bit of work
to untangle everything.

Index: include/llvm/Analysis/TargetTransformInfoImpl.h
===================================================================
--- include/llvm/Analysis/TargetTransformInfoImpl.h (revision 262928)
+++ include/llvm/Analysis/TargetTransformInfoImpl.h (working copy)
@@ -506,6 +506,25 @@
         return TTI::TCC_Free;
     }

+    if (const auto *BO = dyn_cast<BinaryOperator>(U)) {
+      return static_cast<T *>(this)->getArithmeticInstrCost(
+          BO->getOpcode(), BO->getType(), TargetTransformInfo::OK_AnyValue,
+          TargetTransformInfo::OK_AnyValue, TargetTransformInfo::OP_None,
+          TargetTransformInfo::OP_None);
+    }
+
+    if (const auto *Load = dyn_cast<LoadInst>(U)) {
+      return static_cast<T *>(this)->getMemoryOpCost(
+          Load->getOpcode(), Load->getType(), Load->getAlignment(),
+          Load->getPointerAddressSpace());
+    }
+
+    if (const auto *Store = dyn_cast<StoreInst>(U)) {
+      return static_cast<T *>(this)->getMemoryOpCost(
+          Store->getOpcode(), Store->getValueOperand()->getType(),
+          Store->getAlignment(), Store->getPointerAddressSpace());
+    }
+
     return static_cast<T *>(this)->getOperationCost(
         Operator::getOpcode(U), U->getType(),
         U->getNumOperands() == 1 ? U->getOperand(0)->getType() : nullptr);

Quuxplusone commented 8 years ago

Somewhat related question: would it make sense to have both costs: code-size cost and execution-time cost? I encountered similar issue when worked on unroller - we want to reason about tradeoff between code-size increase and execution-time improvement, but we estimate both of them with the same cost.

Quuxplusone commented 8 years ago

(In reply to comment #24)
> (In reply to comment #23)
> > Understood. I'll start by using the vectorizer's switch as a model and
> > copying liberally.
>
> Well, that was too ambitious...
>
> Even a simple patch as pasted below (and which actually does reasonably well
> on the test case for this bug report - the unroll factor is reduced from '4'
> to '2') causes regression test failures in:
>     LLVM :: Transforms/BBVectorize/X86/loop1.ll
>     LLVM :: Transforms/BBVectorize/loop1.ll
>     LLVM :: Transforms/CodeGenPrepare/X86/select.ll
>     LLVM :: Transforms/SimplifyCFG/speculate-math.ll
>
> Can we nuke BBVectorize yet? :)
>
> One of the problems arises from inconsistency about FP op costs:
>
>     // Assume that floating point arithmetic operations cost twice as much as
>     // integer operations.
>     unsigned OpCost = (IsFloat ? 2 : 1);

This is a hack that we should have gotten rid of long ago. It is not even close
to true on many targets (including some of mine). If this is reasonable for any
given set of target configurations, this logic belongs in those targets.

>
> ...but somewhere else:
>
>   unsigned getFPOpCost(Type *Ty) {
>     // By default, FP instructions are no more expensive since they are
>     // implemented in HW.  Target specific TTI can override this.
>     return TargetTransformInfo::TCC_Basic;
>   }
>
> ...but somewhere else again:
>
>     case Instruction::FDiv:
>       return TTI::TCC_Expensive;
>
> When I go to straighten this out, more tests fail. It will take a bit of
> work to untangle everything.

Interesting; this is probably fast-math-dependent, because fdivs are much
cheaper when we can use the reciprocal estimates. Regardless, estimating the
size of the resulting code sequence will require some dedicated logic.

>
...

Quuxplusone commented 8 years ago

(In reply to comment #25)
> Somewhat related question: would it make sense to have both costs: code-size
> cost and execution-time cost? I encountered similar issue when worked on
> unroller - we want to reason about tradeoff between code-size increase and
> execution-time improvement, but we estimate both of them with the same cost.

Ideally, I think we need four costs:

 1. Throughput [what the vectorizer currently uses]
 2. Latency [what the vectorizer really should use to determine its interleaving factor]
 3. Size [what the inliner and full unroller should use, at least most of the time]
 4. uop Count [what should be used for partial unrolling, at least in x86 cores with an LSD]

In the short term, in addition to the rationalization being investigated, I
think that we should group what we have behind a common interface. It is
plausible that (3) and (4) are sufficiently correlated to not require a
distinction. Thoughts?

Quuxplusone commented 8 years ago

(In reply to comment #27)
> (In reply to comment #25)
> > Somewhat related question: would it make sense to have both costs: code-size
> > cost and execution-time cost? I encountered similar issue when worked on
> > unroller - we want to reason about tradeoff between code-size increase and
> > execution-time improvement, but we estimate both of them with the same cost.
>
> Ideally, I think we need four costs:
>
>  1. Throughput [what the vectorizer currently uses]
>  2. Latency [what the vectorizer really should use to determine its
> interleaving factor]
>  3. Size [what the inliner and full unroller should use, at least most of
> the time]
>  4. uop Count [what should be used for partial unrolling, at least in x86
> cores with an LSD]
>
> In the short term, in addition to the rationalization being investigated, I
> think that we should group what we have behind a common interface. It is
> plausible that (3) and (4) are sufficiently correlated to not require a
> distinction. Thoughts?

Yes, I like this idea. At least for me, the interface itself is confusing
because it's not explicit about exactly which "cost" I am getting.

I think there may even be some finer distinctions that we want to surface from
the cost model, and this bug report raises it.

If we know, for example, that a vector operation does not exist for the target,
we probably shouldn't vectorize the IR regardless of whether the cost model
shows a win. Effectively, the vectorizer is encroaching on the unroller's job
because we know that legalization is going to have to "unroll" the vector op
for us.

For the case shown here, we're harming the unroller because the vectorizer has
created its own unroll factor. This limits the multipliers that the unroller
can use. AFAICT, there is no way that the unroller can overcome this currently.

If the cost model told the unroller that the vector add was illegal, it could
take that into account, although I would think it's still better that the
vectorizer didn't fire in the first place for this particular example.

Quuxplusone commented 8 years ago

(In reply to comment #28)
> (In reply to comment #27)
> > (In reply to comment #25)
> > > Somewhat related question: would it make sense to have both costs: code-
size
> > > cost and execution-time cost? I encountered similar issue when worked on
> > > unroller - we want to reason about tradeoff between code-size increase and
> > > execution-time improvement, but we estimate both of them with the same
cost.
> >
> > Ideally, I think we need four costs:
> >
> >  1. Throughput [what the vectorizer currently uses]
> >  2. Latency [what the vectorizer really should use to determine its
> > interleaving factor]
> >  3. Size [what the inliner and full unroller should use, at least most of
> > the time]
> >  4. uop Count [what should be used for partial unrolling, at least in x86
> > cores with an LSD]
> >
> > In the short term, in addition to the rationalization being investigated, I
> > think that we should group what we have behind a common interface. It is
> > plausible that (3) and (4) are sufficiently correlated to not require a
> > distinction. Thoughts?
>
> Yes, I like this idea. At least for me, the interface itself is confusing
> because it's not explicit about exactly which "cost" I am getting.
>
> I think there may even be some finer distinctions that we want to surface
> from the cost model, and this bug report raises it.
>
> If we know, for example, that a vector operation does not exist for the
> target, we probably shouldn't vectorize the IR regardless of whether the
> cost model shows a win. Effectively, the vectorizer is encroaching on the
> unroller's job because we know that legalization is going to have to
> "unroll" the vector op for us.

I think we need to be careful here, because for large loops, even if we have to
scalarize some small piece of it, the vectorization overall can be a huge win.
On some architectures, scalarization is not actually all that expensive. On
others, especially where it involves trips through the stack, it can kill you.
We might, however, want to be a little more conservative for edge cases (at
least once our cost model is more sane).

>
> For the case shown here, we're harming the unroller because the vectorizer
> has created its own unroll factor. This limits the multipliers that the
> unroller can use. AFAICT, there is no way that the unroller can overcome
> this currently.
>
> If the cost model told the unroller that the vector add was illegal, it
> could take that into account, although I would think it's still better that
> the vectorizer didn't fire in the first place for this particular example.

Quuxplusone commented 8 years ago

(In reply to comment #27)
> (In reply to comment #25)
> > Somewhat related question: would it make sense to have both costs: code-size
> > cost and execution-time cost? I encountered similar issue when worked on
> > unroller - we want to reason about tradeoff between code-size increase and
> > execution-time improvement, but we estimate both of them with the same cost.
>
> Ideally, I think we need four costs:
>
>  1. Throughput [what the vectorizer currently uses]
>  2. Latency [what the vectorizer really should use to determine its
> interleaving factor]
>  3. Size [what the inliner and full unroller should use, at least most of
> the time]
>  4. uop Count [what should be used for partial unrolling, at least in x86
> cores with an LSD]
>
> In the short term, in addition to the rationalization being investigated, I
> think that we should group what we have behind a common interface. It is
> plausible that (3) and (4) are sufficiently correlated to not require a
> distinction. Thoughts?

Yeah, having these data would be very valuable for a number of passes. But it's
important to remember that it's still just a rough estimate, so I don't know
how far we should go in providing details about it.

For instance, latency and throughput are quite a low-level attributes, they
obviously apply well for actual assembler instructions, but using them for IR
instructions might be a bit inaccurate. My point is that while we do need more
granular costs, I'm not completely sure that we need to model (or pretend that
we're modeling) such properties as 'latency', 'throughput', and esp. 'uop
Count'. What do you think?

Quuxplusone commented 8 years ago

Hi everyone,

The piece of code in question seems to be jinxed - could someone have a look at issue 26882 or the link below to see why it makes llvm crash?

http://goo.gl/n3jsLP

Thx

Quuxplusone commented 8 years ago

@PeteVine, can you please attache the LLVM IR and a stack trace?

If the code crashes then it would be a good idea to file a new bug report. Please CC me on the new bug report. I'd like to try to help.

Quuxplusone commented 8 years ago

http://reviews.llvm.org/D18537

Looks like that patch will prevent the unwanted vectorization for this test
case.
Thanks, Hal!

Note that the unroller will simply stamp out the same (dependent) op over and
over:
    addl    -28(%rcx), %eax
    addl    -24(%rcx), %eax
    addl    -20(%rcx), %eax
...

So we get no real parallelization after that patch, but I don't think that
really matters here.

In theory, the "Pentium3 (+sse)" perf case from the original problem
description should be the same as "Pentium2" now.

Quuxplusone commented 8 years ago

Thanks, next time I've built rustc it will have been with this patch. I'll let you know about the result in a few days.

Quuxplusone commented 8 years ago

(In reply to comment #34)
> Thanks, next time I've built rustc it will have been with this patch. I'll
> let you know about the result in a few days.

Committed (with Sanjay's test case from Comment 14) r264904.

Quuxplusone commented 8 years ago

Yes, fully fixed, as predicted.

Quuxplusone / LLVMBugzillaTest

Availability of SSE detrimental to code performance #26836