Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

X86 code quality problems highlighted by the TSVC loop benchmarks #14111

Open Quuxplusone opened 12 years ago

Quuxplusone commented 12 years ago
Bugzilla Link PR14076
Status NEW
Importance P enhancement
Reported by Hal Finkel (hfinkel@anl.gov)
Reported on 2012-10-12 16:59:24 -0700
Last modified on 2013-04-08 12:09:07 -0700
Version trunk
Hardware PC All
CC clattner@nondot.org, daniel@zuster.org, llvm-bugs@lists.llvm.org, michael.hliao@gmail.com, pawel@32bitmicro.com, rafael@espindo.la
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
On the general assumption that running the TSVC loops with floats should be
faster than running them with doubles, I suspect we have suboptimal code
generation for the following tests:

[from: http://llvm.org/perf/db_default/v4/nts/4826]

MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl   5.3408
MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt   6.6713

MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl   5.4195
MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt   8.6970

Also, while the double-precision code should be slower, as we're not
vectorizing, can it legitimately be 2x slower? If not, these also indicate a
problem:

MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl     7.1848
MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt     4.3780

MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl     8.3210
MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt     4.2134

These also seems questionable:

MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl     8.5494
MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt     5.5331

MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl     4.9180
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt     3.5667

When these tests are run on my POWER7 (powerpc64) machine, the timing on the
tests is as I would expect: the float and double versions take approximately
the same amount of time to execute, with the double-precision version generally
taking slightly more time. As a result, I suspect that these problems are
specific to x86 codegen.
Quuxplusone commented 12 years ago

What machine are you running this on? Do you have SSE2 enabled?

-Chris

Quuxplusone commented 12 years ago
(In reply to comment #1)
> What machine are you running this on?  Do you have SSE2 enabled?
>
> -Chris

The numbers I quoted are from the Darwin x86_64 builder:
http://llvm.org/perf/db_default/v4/nts/4826
(the build options for that builder are just -O3, but I'm pretty sure that
enables SSE).

I've seen very similar relative numbers on my x86_64 linux machines, and on
those, SSE is certainly enabled.
Quuxplusone commented 11 years ago

This isn't actually a code gen problem, this is because the benchmarks are written in such a way that they are hitting a specific X86 problem due to the exact address layout of the global variables. This is known as the 4k-aliasing problem on Intel, see http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/amplifierxe/win/ug_docs/GUID-C801145A-A066-4C1A-B744-2B51AD89EFF6.htm

Proposing a patch on the ML to address this and otherwise make the benchmarks a bit more predictable across platforms.

Quuxplusone commented 11 years ago
Improved here:
  http://llvm.org/viewvc/llvm-project?view=revision&revision=178968

Here is an LNT run for the change:
  http://llvm.org/perf/db_default/v4/nts/10308

As you can see, the tests got significantly faster just from changing the data
layout which shows that these tests are incredibly susceptible to address
layout, at least on Intel.

The following benchmarks are now more in line w.r.t. the dbl/flt ratio (this
can be seen from the bigger % improvement on the dbl benchmark):
--
MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl 4.1955  -36.76%
MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt 2.9747  -11.50%

MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl 4.2053  -48.97%
MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt 3.8445  -8.54%

MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl 4.5808  -46.36%
MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt 4.2371  -23.84%
--

There is one test remaining which still seems out of line:
--
MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl   4.8740  -9.98%
MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt   8.1706  -6.02%
--

And Symbolics still has a high dbl/flt ratio:
--
MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl 4.5023  -
MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt 2.9224  -1.30%
--

I will take a quick look at Reductions and Symbolics to see if I notice
anything fishy.
Quuxplusone commented 11 years ago

From my "quick look", it looks like something is off with Reductions-flt, at least icc with vectorizer disabled gets similar performance on Reductions-{flt,dbl}. I don't see anything fishy with Symbolics.

It's probably worth investigating the Reductions-flt performance issue more, so I will leave this open, but my main interest here was in just getting the benchmarks to not be noisy on Sandy Bridge and they have stabilized so I'm not personally planning on doing any more investigation for the time being.