Closed Quuxplusone closed 8 years ago
Attached fannkuch.rs
(3592 bytes, application/octet-stream): fannkuch benchmark
Attached spectral.rs
(3264 bytes, application/octet-stream): spectral-norm benchmark
Another case from kmeans benchmark (https://github.com/andreaferretti/kmeans)
running on Cortex-A5.
10 runs' average time:
Cortex-A5 1027 ms
Cortex-A9 1012 ms
Compiled with rustc 1.8.0 (llvm 3.8)using the following flags:
-C opt-level=3 -C target-cpu=cortex-a5(9) -C target-feature=+vfp4,+neon,+v7
Attached kmeans.rs
(4005 bytes, application/octet-stream): Kmeans benchmark source
Hi again,
I think I've finally found something worth looking at :) The following rust benchmark:
https://github.com/nikomatsakis/rust-runtime-benchmarks/tree/master/runtime-benchmarks/raytrace
produces the following results (using MIR translation via -Zorbit):
ARMv6 vfp2
test bench ... bench: 239,727,289 ns/iter (+/- 274,873)
ARMv7 cortex-a5 vfp4
test bench ... bench: 264,764,759 ns/iter (+/- 159,791)
Adding +neon
degrades performance further a little bit but that's just noise probably. Hope you care to look at the files I'm going to attach. Thanks!
Attached raytrace-files.zip
(219528 bytes, application/zip): raytrace benchmark assembly + IR
Hi Pete,
That disassembly is huge. Where's the hot part?
Cheers,
James
Oops, I must started believing in your powers of clairvoyance ;)
The slower version produces a trace like this:
Profiling raytrace-a8fc715418690b23 with callgrind...
** Total Instructions....342,773,768 **
141,645,505 (41.3%)
model.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
110,769,356 (32.3%)
vec.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
67,971,353 (19.8%) model.rs:mallocx'2
-----------------------------------------------------------------------
12,524,673 (3.7%) slice.rs:mallocx'2
-----------------------------------------------------------------------
6,248,154 (1.8%) ptr.rs:mallocx'2
-----------------------------------------------------------------------
1,279,177 (0.4%) vec.rs:mallocx'2
-----------------------------------------------------------------------
669,986 (0.2%) render.rs:mallocx'2
-----------------------------------------------------------------------
399,334 (0.1%) main.rs:mallocx'2
-----------------------------------------------------------------------
307,841 (0.1%) iterator.rs:mallocx'2
-----------------------------------------------------------------------
260,430 (0.1%) lib.rs:mallocx'2
-----------------------------------------------------------------------
244,210 (0.1%)
lib.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
238,745 (0.1%)
vec.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
215,004 (0.1%)
f32.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
and the faster version:
** Total Instructions...281,513,893 **
117,094,561 (41.6%)
model.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
73,884,689 (26.2%)
vec.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
67,926,746 (24.1%) model.rs:mallocx'2
-----------------------------------------------------------------------
12,506,985 (4.4%) slice.rs:mallocx'2
-----------------------------------------------------------------------
6,246,225 (2.2%) ptr.rs:mallocx'2
-----------------------------------------------------------------------
997,208 (0.4%) vec.rs:mallocx'2
-----------------------------------------------------------------------
653,355 (0.2%) render.rs:mallocx'2
-----------------------------------------------------------------------
399,504 (0.1%) main.rs:mallocx'2
-----------------------------------------------------------------------
308,487 (0.1%) iterator.rs:mallocx'2
-----------------------------------------------------------------------
257,294 (0.1%) lib.rs:mallocx'2
-----------------------------------------------------------------------
243,601 (0.1%)
lib.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
215,004 (0.1%)
f32.rs:_..raytrace..model..Sphere..as..raytrace..model..Model..::hit
-----------------------------------------------------------------------
213,231 (0.1%)
vec.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
153,709 (0.1%)
materials.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
153,084 (0.1%)
wrapping.rs:_..raytrace..materials..Lambertian..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
130,364 (0.0%)
materials.rs:_..raytrace..materials..Metal..as..raytrace..materials..Material..::scatter
-----------------------------------------------------------------------
129,846 (0.0%) dl-tls.c:__tls_get_addr
-----------------------------------------------------------------------
Cheers!
OK, the fault seems pretty obvious. We're not forming VMLA/VMLS when targetting Cortex-A5.
This appears to be because we've set the "FeatureHasSlowFPVMLx" feature flag, which is almost certainly because the A5 model was based directly on Cortex-A8 which had a slowdown when using VMLA/VMLS.
Fix is simply to remove this flag.
James
Excellent! Is this feature settable from the command line or are we talking about a patch?
Ideally a patch, but give -mattr=-slowfpvmlx a whirl (note the minus sign instead of a plus).
Indeed, that brings performance completely in line with the fastest version.
Thanks!
Forgot to ask; is the current generation of Cortex A8's still affected or can this flag be enabled safely everywhere? TIA!
fannkuch.rs
(3592 bytes, application/octet-stream)spectral.rs
(3264 bytes, application/octet-stream)kmeans.rs
(4005 bytes, application/octet-stream)raytrace-files.zip
(219528 bytes, application/zip)