_mm_dp_ps generating 55% more inefficient instructions

Quuxplusone commented 11 years ago


Bugzilla Link	PR14268
Status	NEW
Importance	P enhancement
Reported by	ramihg@gmail.com
Reported on	2012-11-05 21:09:25 -0800
Last modified on	2014-06-01 22:36:38 -0700
Version	trunk
Hardware	PC All
CC	chris.a.ferguson@gmail.com, clattner@nondot.org, craig.topper@gmail.com, geek4civic@gmail.com, llvm-bugs@lists.llvm.org, pawel@32bitmicro.com, rafael@espindo.la
Fixed by commit(s)
Attachments	`dotps.cpp` (2475 bytes, application/octet-stream)
Blocks
Blocked by
See also

Created attachment 9499
Simple test case

OS used: Mac OS X Mountain Lion
Clang/LLVM Used: 3.2-r167157
Processor: Intel Core i7

I was recently doing some SSE dot product implementations when I noticed a
severe performance drop after using _mm_dp_ps. This didn't make sense as the
x86 instruction latency manual showed that the dpps instruction is only 12
cycles latency.

Further investigation showed that dpps with a memory operand source: (e.g. dpps
$15, (%rex, %rdx), %xmm0) is about 55% slower than issuing a movaps then a dpps
with a register source/destination!

I've created and attached a simple test case that does dpps using intrinsics,
then using hand-written assembly with register operands, then using hand-
written assembly with memory source operand. The intrinsics version issued
instructions with memory operands.

Here is how I compiled the test:
clang++ -march=native -std=c++11 -stdlib=libc++ -O3 dotps.cpp

Here are the results I got (running each version 100000000 times):
DotPsIntrin: 249.332 ms
DotPsFast: 159.916 ms
DotPsSlow: 249.076 ms

I also tried the intrinsics version with Visual Studio 2012 and the Intel
compiler. Both generated the efficient movaps/dpps version even when specifying
"Optimize for Space".

Quuxplusone commented 11 years ago

Attached dotps.cpp (2475 bytes, application/octet-stream): Simple test case

Quuxplusone commented 11 years ago

What does gcc do?

Quuxplusone commented 11 years ago

(In reply to comment #1)
> What does gcc do?

It seems that gcc 4.7.2 (Ubuntu 64-bit) is also producing the instruction with
a memory operand:

g++ -O3 -std=c++11 -march=native dotps.cpp

movaps (%rdx),%xmm0
dpps $0xff,0x4(%rdi,%rcx,4),%xmm0

I wonder if this slowdown is limited only to dpps - or potentially to other SSE
instructions. I also wonder if this happens on AMD processors. Unfortunately I
do not have access to any AMD processors that I can test on.

Quuxplusone commented 11 years ago

Which Core i7, penryn or sandybridge?

AFAIK AMD K8-based microarchitecture might have stronger address generator (than Intel's). Not sure Bulldozer.

Quuxplusone commented 11 years ago

(In reply to comment #3)
> Which Core i7, penryn or sandybridge?
>
> AFAIK AMD K8-based microarchitecture might have stronger address generator
> (than Intel's). Not sure Bulldozer.

This is a Sandy Bridge.

Quuxplusone / LLVMBugzillaTest

_mm_dp_ps generating 55% more inefficient instructions #14292