Rewriting VMLA.F32 instructions as VMUL+VADD is not a feature, it's a bug!

Quuxplusone commented 8 years ago


Bugzilla Link	PR27219
Status	RESOLVED FIXED
Importance	P normal
Reported by	Benoit Jacob (jacob.benoit.1@gmail.com)
Reported on	2016-04-05 10:18:51 -0700
Last modified on	2017-06-30 08:33:26 -0700
Version	3.8
Hardware	All Linux
CC	echristo@gmail.com, james@jamesmolloy.co.uk, kristof.beyls@arm.com, llvm-bugs@lists.llvm.org, pirama@google.com, spatel+llvm@rotateright.com, srhines@google.com
Fixed by commit(s)
Attachments	`vmlaq_f32_testcase.cc` (123 bytes, text/x-c++src)
Blocks
Blocked by
See also

Created attachment 16172
testcase

For some values of -mcpu, at least -mcpu=cortex-a8 and -mcpu=cortex-a7, LLVM
replaces VMLA.F32 instructions by a (VMUL, VADD) pair.

That much seems to be well-known:
https://groups.google.com/d/msg/llvm-dev/N9u8Kv1m5do/GCyge4kZSnwJ

Apparently, the idea is that on some old Cortex A8 CPUs, there was a
performance problem with VMLA, so replacing it with (VMUL, VADD) was a work-
around for that.

However, that is missing two facts:

Fact #1:

A (VMUL, VADD) pair needs a register to hold the temporary result of the VMUL.
In fully register-tight code making use of all NEON registers, that means
spilling.

Concretely, matrix multiplication (GEMM) kernels are an example of critical
code using all available NEON registers and doing mostly VMLA. That's how I
stumbled upon this bug: Eigen (http://eigen.tuxfamily.org) was generating
unexplainably bad code, with massive register spillage, running 10x slower than
normal.

Eigen needs to know the number of available registers, and whether a single-
instruction multiply-accumulate (thus not requiring an intermediate temporary
register) is available.
https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-23

The LLVM behavior of silently replacing VMLA by VMUL+VADD breaks what was
supposed to be an architecture invariant, and breaks Eigen's assumptions.

For now, Eigen works around this by reimplementing the vmlaq_f32 intrinsic in
inline assembly:
https://bitbucket.org/eigen/eigen/src/78884e16715fc9a7b726db39195ac8bb17103181/Eigen/src/Core/arch/NEON/PacketMath.h?at=default&fileviewer=file-view-default#PacketMath.h-192

One problem in the microbenchmarks that people have been discussing in the
above llvm mailing list thread, is that they measured isolated VMLA
instructions, not accounting for the side effects on register pressure, which
quickly become dominant in real-world register-tight numerical code.

Fact #2: Most software compiled with this VMLA rewriting, isn't actually
intended to run on a cortex-a8 device specifically.

I'm getting the VMLA rewriting even without passing any -mcpu flag, probably
because -mcpu=cortex-a8 (or some such) is the default:

~/android/toolchains/arm-linux-androideabi-clang3.5/bin/arm-linux-androideabi-
clang++ ~/vrac/vmlaq_f32_testcase.cc -S -o v.s -march=armv7-a -mfloat-
abi=softfp -mfpu=neon -O3

In this command line, I didn't say that I was interested in cortex-a8, so why
would I be getting a cortex-a8 workaround that's detrimental on every other
device, and potentially catastrophic on register-tight code?

Quuxplusone commented 8 years ago

Attached vmlaq_f32_testcase.cc (123 bytes, text/x-c++src): testcase

Quuxplusone commented 8 years ago

Note: this bug exists at least in clang versions 3.5 and 3.8.

Quuxplusone commented 8 years ago

Kristof, I know that this particular issue has been discussed before, but perhaps there is a reason to revisit the original decision. The extra register usage does seem particularly troubling.

Quuxplusone commented 8 years ago

Hi,

It's not a (it's not a feature it's a bug), it's a feature! :)

Seriously though, Clang has no concept of a blended code generation for all cores implementing an architecture variant. Clang compiles for some cpu and that's it.

If you don't set -mcpu, it will be set under your feet. For -march=armv7a, the default is indeed -mcpu=cortex-a8 which does enable this performance erratum workaround.

When this was brought up before, we suggested using a more sane -mcpu argument in Android, such as -mcpu=cortex-a15 or -mcpu=cortex-a57 for -march=armv8a. Is that not an option still?

The alternative is that we implement some blended mode and introduce a fake CPU target for it (like "generic" or something).

James

Quuxplusone commented 8 years ago

I think Benoit has already worked around this issue by setting a more appropriate cpu target. I think the concern is that generic users of clang for the NDK might not be setting mcpu at all, and that will lead them to low performance.

Quuxplusone commented 8 years ago

Hi Steve,

In fact, it looks like we implemented this for clang-3.8:

$ clang-3.8 test.c -o - -S -mfloat-abi=hard -march=armv7a -O3
...
    vmul.f32    q8, q0, q0
    vadd.f32    q0, q8, q0
    bx  lr
...

$ clang-3.8 test.c -o - -S -mfloat-abi=hard -march=armv7a -O3 -mcpu=generic
...
    vmla.f32    q0, q0, q0
    bx  lr
...

So cortex-a8 is still the default, but -mcpu=generic will get you an
architectural target with no core-specific workarounds.

Changing the default from cortex-a8 is probably a no-go given how long it's
been that way, I think.

James

Quuxplusone commented 7 years ago

Coming back to this, I'm going to weigh in that I also think that this might not be the expected behavior. While some specific cortex-a8 cpus need the workaround not all of them do and so the workaround should only be enabled when targeting those specific cpus and not the generic core.

James: Thoughts?

Quuxplusone commented 7 years ago

By this point I'd be perfectly happy to kill -mcpu=cortex-a8 being the default, but I'm really worried about the inevitable code permutation users will notice, and that may even cause regressions.

Unless someone wants to take on a lot of bugzilla traiging pain, I think we really need to consider this hard-baked by now. What do you think?

As an aside, I wasn't aware that there were A8's that weren't affected by this bug? But a8 was somewhat before my time.

Quuxplusone commented 7 years ago

Honestly I think -mcpu=generic is probably the right way to go here if nothing is set for the cpu. While it seems like it's been baked for a long time in the grand scheme of things it might not be so bad to change it.

Quuxplusone commented 7 years ago

Fixed in r304390, by making -mcpu=generic the default.
In r306514, -mcpu=generic was made to schedule instructions in the same way as
when targeting Cortex-A8, to overcome a small performance loss observed from
making -mcpu=generic the default.
More details are available at http://lists.llvm.org/pipermail/llvm-dev/2017-
May/113525.html

Quuxplusone commented 7 years ago

Many thanks for the good analysis and resolution.

Quuxplusone / LLVMBugzillaTest

Rewriting VMLA.F32 instructions as VMUL+VADD is not a feature, it's a bug! #27218