Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Cortex-A5 codegen suboptimal? #26134

Closed Quuxplusone closed 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR26135
Status RESOLVED WORKSFORME
Importance P normal
Reported by PeteVine (tulipawn@gmail.com)
Reported on 2016-01-13 17:06:39 -0800
Last modified on 2016-09-02 13:44:33 -0700
Version 3.7
Hardware Other Linux
CC james@jamesmolloy.co.uk, llvm-bugs@lists.llvm.org
Fixed by commit(s)
Attachments fannkuch.rs (3592 bytes, application/octet-stream)
spectral.rs (3264 bytes, application/octet-stream)
kmeans.rs (4005 bytes, application/octet-stream)
raytrace-files.zip (219528 bytes, application/zip)
Blocks
Blocked by
See also
It seems code generation for Cortex-A5 and armv7 doesn't provide much benefit,
sometimes even coming out a little slower (compared to default and v6
respectively).

Firstly, our old friend from issue #26106, w/o NEON and unrolling:

default cpu - v7, v6:

test sum_deque   ... bench:       5,219 ns/iter (+/- 56), 4,967 ns/iter (+/- 50)

test sum_deque_2 ... bench:       3,272 ns/iter (+/- 40), 3,112 ns/iter (+/- 22)

It seems v6 code is faster on this cpu.

Secondly a few benchmarks, where the cortex-a5 target is either equal or
slower. (4-core)

Spectral-norm benchmark, v7:
$ time ./spectral 5500

default cpu                        cortex-a5

real    0m9.106s         real    0m9.051s
user    0m34.110s        user    0m34.240s
sys     0m0.040s         sys     0m0.020s

Fannkuch benchmark:
$ time ./fannkuch 12

real    0m30.017s         real    0m30.645s
user    1m55.570s         user    1m56.350s
sys     0m0.030s          sys     0m0.010s

Command used to compile:
rustc -C opt-level=3 -C target-feature=+v7 -C target-cpu=cortex-a5

Those were just a few examples off the top of my head, probably not the best
ones. I'm sure I've also seen an example or two benefiting from the cortex-a5
target but can't remember which.
Quuxplusone commented 8 years ago

Attached fannkuch.rs (3592 bytes, application/octet-stream): fannkuch benchmark

Quuxplusone commented 8 years ago

Attached spectral.rs (3264 bytes, application/octet-stream): spectral-norm benchmark

Quuxplusone commented 8 years ago
Another case from kmeans benchmark (https://github.com/andreaferretti/kmeans)
running on Cortex-A5.

10 runs' average time:

Cortex-A5 1027 ms
Cortex-A9 1012 ms

Compiled with rustc 1.8.0 (llvm 3.8)using the following flags:

-C opt-level=3 -C target-cpu=cortex-a5(9) -C target-feature=+vfp4,+neon,+v7
Quuxplusone commented 8 years ago

Attached kmeans.rs (4005 bytes, application/octet-stream): Kmeans benchmark source

Quuxplusone commented 8 years ago

Hi again,

I think I've finally found something worth looking at :) The following rust benchmark:

https://github.com/nikomatsakis/rust-runtime-benchmarks/tree/master/runtime-benchmarks/raytrace

produces the following results (using MIR translation via -Zorbit):

ARMv6 vfp2 test bench ... bench: 239,727,289 ns/iter (+/- 274,873)

ARMv7 cortex-a5 vfp4 test bench ... bench: 264,764,759 ns/iter (+/- 159,791)

Adding +neon degrades performance further a little bit but that's just noise probably. Hope you care to look at the files I'm going to attach. Thanks!

Quuxplusone commented 8 years ago

Attached raytrace-files.zip (219528 bytes, application/zip): raytrace benchmark assembly + IR

Quuxplusone commented 8 years ago

Hi Pete,

That disassembly is huge. Where's the hot part?

Cheers,

James

Quuxplusone commented 8 years ago
Oops, I must started believing in your powers of clairvoyance ;)

The slower version produces a trace like this:

Profiling raytrace-a8fc715418690b23 with callgrind...

** Total Instructions....342,773,768 **

141,645,505 (41.3%)
model.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
110,769,356 (32.3%)
vec.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
67,971,353 (19.8%) model.rs:mallocx'2
-----------------------------------------------------------------------
12,524,673 (3.7%) slice.rs:mallocx'2
-----------------------------------------------------------------------
6,248,154 (1.8%) ptr.rs:mallocx'2
-----------------------------------------------------------------------
1,279,177 (0.4%) vec.rs:mallocx'2
-----------------------------------------------------------------------
669,986 (0.2%) render.rs:mallocx'2
-----------------------------------------------------------------------
399,334 (0.1%) main.rs:mallocx'2
-----------------------------------------------------------------------
307,841 (0.1%) iterator.rs:mallocx'2
-----------------------------------------------------------------------
260,430 (0.1%) lib.rs:mallocx'2
-----------------------------------------------------------------------
244,210 (0.1%)
lib.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
238,745 (0.1%)
vec.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
215,004 (0.1%)
f32.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------

and the faster version:

** Total Instructions...281,513,893 **

117,094,561 (41.6%)
model.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
73,884,689 (26.2%)
vec.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
67,926,746 (24.1%) model.rs:mallocx'2
-----------------------------------------------------------------------
12,506,985 (4.4%) slice.rs:mallocx'2
-----------------------------------------------------------------------
6,246,225 (2.2%) ptr.rs:mallocx'2
-----------------------------------------------------------------------
997,208 (0.4%) vec.rs:mallocx'2
-----------------------------------------------------------------------
653,355 (0.2%) render.rs:mallocx'2
-----------------------------------------------------------------------
399,504 (0.1%) main.rs:mallocx'2
-----------------------------------------------------------------------
308,487 (0.1%) iterator.rs:mallocx'2
-----------------------------------------------------------------------
257,294 (0.1%) lib.rs:mallocx'2
-----------------------------------------------------------------------
243,601 (0.1%)
lib.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
215,004 (0.1%)
f32.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
213,231 (0.1%)
vec.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
153,709 (0.1%)
materials.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
153,084 (0.1%)
wrapping.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
130,364 (0.0%)
materials.rs:_..raytrace..materials..Metal..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
129,846 (0.0%) dl-tls.c:__tls_get_addr
-----------------------------------------------------------------------

Cheers!
Quuxplusone commented 8 years ago

OK, the fault seems pretty obvious. We're not forming VMLA/VMLS when targetting Cortex-A5.

This appears to be because we've set the "FeatureHasSlowFPVMLx" feature flag, which is almost certainly because the A5 model was based directly on Cortex-A8 which had a slowdown when using VMLA/VMLS.

Fix is simply to remove this flag.

James

Quuxplusone commented 8 years ago

Excellent! Is this feature settable from the command line or are we talking about a patch?

Quuxplusone commented 8 years ago

Ideally a patch, but give -mattr=-slowfpvmlx a whirl (note the minus sign instead of a plus).

Quuxplusone commented 8 years ago

Indeed, that brings performance completely in line with the fastest version.

Thanks!

Quuxplusone commented 8 years ago

Forgot to ask; is the current generation of Cortex A8's still affected or can this flag be enabled safely everywhere? TIA!