Open Quuxplusone opened 12 years ago
Bugzilla Link | PR14076 |
Status | NEW |
Importance | P enhancement |
Reported by | Hal Finkel (hfinkel@anl.gov) |
Reported on | 2012-10-12 16:59:24 -0700 |
Last modified on | 2013-04-08 12:09:07 -0700 |
Version | trunk |
Hardware | PC All |
CC | clattner@nondot.org, daniel@zuster.org, llvm-bugs@lists.llvm.org, michael.hliao@gmail.com, pawel@32bitmicro.com, rafael@espindo.la |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
What machine are you running this on? Do you have SSE2 enabled?
-Chris
(In reply to comment #1)
> What machine are you running this on? Do you have SSE2 enabled?
>
> -Chris
The numbers I quoted are from the Darwin x86_64 builder:
http://llvm.org/perf/db_default/v4/nts/4826
(the build options for that builder are just -O3, but I'm pretty sure that
enables SSE).
I've seen very similar relative numbers on my x86_64 linux machines, and on
those, SSE is certainly enabled.
This isn't actually a code gen problem, this is because the benchmarks are written in such a way that they are hitting a specific X86 problem due to the exact address layout of the global variables. This is known as the 4k-aliasing problem on Intel, see http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/ug_docs/GUID-C801145A-A066-4C1A-B744-2B51AD89EFF6.htm
Proposing a patch on the ML to address this and otherwise make the benchmarks a bit more predictable across platforms.
Improved here:
http://llvm.org/viewvc/llvm-project?view=revision&revision=178968
Here is an LNT run for the change:
http://llvm.org/perf/db_default/v4/nts/10308
As you can see, the tests got significantly faster just from changing the data
layout which shows that these tests are incredibly susceptible to address
layout, at least on Intel.
The following benchmarks are now more in line w.r.t. the dbl/flt ratio (this
can be seen from the bigger % improvement on the dbl benchmark):
--
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl 4.1955 -36.76%
MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt 2.9747 -11.50%
MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl 4.2053 -48.97%
MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt 3.8445 -8.54%
MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl 4.5808 -46.36%
MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt 4.2371 -23.84%
--
There is one test remaining which still seems out of line:
--
MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl 4.8740 -9.98%
MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt 8.1706 -6.02%
--
And Symbolics still has a high dbl/flt ratio:
--
MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl 4.5023 -
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt 2.9224 -1.30%
--
I will take a quick look at Reductions and Symbolics to see if I notice
anything fishy.
From my "quick look", it looks like something is off with Reductions-flt, at least icc with vectorizer disabled gets similar performance on Reductions-{flt,dbl}. I don't see anything fishy with Symbolics.
It's probably worth investigating the Reductions-flt performance issue more, so I will leave this open, but my main interest here was in just getting the benchmarks to not be noisy on Sandy Bridge and they have stabilized so I'm not personally planning on doing any more investigation for the time being.