ARM NEON vectorization miscompiles code in 5.0

Quuxplusone commented 7 years ago


Bugzilla Link	PR33921
Status	RESOLVED FIXED
Importance	P enhancement
Reported by	Martin Storsjö (martin@martin.st)
Reported on	2017-07-25 04:17:55 -0700
Last modified on	2017-08-21 12:54:40 -0700
Version	5.0
Hardware	PC All
CC	ahmed@bougacha.org, ayal.zaks@intel.com, charles.baylis@linaro.org, diana.picus@linaro.org, hans@chromium.org, james@jamesmolloy.co.uk, jeroen.ketema@yahoo.com, llvm-bugs@lists.llvm.org, rengolin@gmail.com
Fixed by commit(s)
Attachments	`clang-5.0-vectorize-bug.tar.gz` (1262 bytes, application/x-gzip) `clang-5.0-vectorize-bug.tar.gz` (1031 bytes, application/x-gzip) `vp9dsp.i` (22813 bytes, text/plain) `vp9dsp.c` (1603 bytes, text/plain) `vp9dsp.ll` (15394 bytes, text/plain) `vp9dsp.ll` (5952 bytes, text/plain) `opt-vectorize` (645857 bytes, text/plain) `opt-novectorize` (465918 bytes, text/plain) `vp9dsp-vectorized.ll` (15309 bytes, text/plain) `shuffle.ll` (676 bytes, text/plain)
Blocks	PR33849
Blocked by
See also

Starting with SVN r306803, some C code that wasn't vectorized before now is vectorized. This seems to break e.g. some VP9 inverse transforms in libavcodec, in code like this:

https://git.libav.org/?p=libav.git;a=blob;f=libavcodec/vp9dsp.c;h=7f863943284b8970744c81b0b7d749e5c9cf8bdf;hb=2b1324bd167553f49736e4eaa94f96da9982925e#l947 with the idct8_1d and iadst8_1d functions. If this file is built with -fno-tree-vectorize, the issue disappears.

This regressed when upgrading from SVN r305659 to the 5.0 branch point: https://fate.libav.org/armv7-win32-clang-5.0

Bisecting this leads to SVN r306803. The same issue can also be reproduced when targeting linux, not only windows. It should be reproducable by building libav and running "make fate-checkasm-vp9dsp", which should pass unless this is miscompiled.

I will dig into it further, hopefully soon, to fill in with a better reproducable testcase of the issue.

Quuxplusone commented 7 years ago

+Ayal

Quuxplusone commented 7 years ago

> with -O3 -fno-tree-vectorize

you mean -O3 -fno-vectorize (or else the bug may be in GCC's auto-vectorizer? ;-
)

The reference version not-to-be-vectorized can instead be decorated with

#pragma clang loop vectorize(disable)
  for (i = 0; i < 8; i++)
    ...

Can the reproducer be made target independent?

Quuxplusone commented 7 years ago

Attached clang-5.0-vectorize-bug.tar.gz (1262 bytes, application/x-gzip): Vectorizer bug reproduction case

Quuxplusone commented 7 years ago

Attached clang-5.0-vectorize-bug.tar.gz (1031 bytes, application/x-gzip): Vectorizer bug reproduction case

Quuxplusone commented 7 years ago

(In reply to ayal.zaks from comment #3)
> > with -O3 -fno-tree-vectorize
>
> you mean -O3 -fno-vectorize (or else the bug may be in GCC's
> auto-vectorizer? ;-)

Sure - updated the example to use that option name.

> The reference version not-to-be-vectorized can instead be decorated with
>
> #pragma clang loop vectorize(disable)
>   for (i = 0; i < 8; i++)
>     ...

Hmm, I could try that later.

> Can the reproducer be made target independent?

I got it stripped down a bit further now and shouldn't be arm specific any
longer. I do think the bug is limited to ARM though, because I wasn't able to
reproduce it on e.g. aarch64, where the SIMD support otherwise are pretty
similar to ARM.

Quuxplusone commented 7 years ago

I got it stripped down a bit further now and shouldn't be arm specific any longer. I do think the bug is limited to ARM though, because I wasn't able to reproduce it on e.g. aarch64, where the SIMD support otherwise are pretty similar to ARM.

I tried it natively on x86 with a recent version of trunk, replacing "-target armv7-linux-gnueabihf" with "-Rpass=loop-vectorize". The test passes, noting that (only) the first compilation vectorized the two loops in vp9dsp.c (lines 45 and 47), with VF=8 and UF=1, presumably as intended. So this indeed looks like an ARM/NEON specific issue, as the title indicates.

Quuxplusone commented 7 years ago

Weird, as off SVN@308006, I am seeing no difference between:

$ clang -target armv7-linux-gnueabihf -S vp9dsp.c -o vp9dsp.s -O3

and

$ clang -target armv7-linux-gnueabihf -S vp9dsp.c -o vp9dsp-novec.s -O3 -fno-vectorize

ie. it can't vectorise either.

Was the patch that introduced the vectorisation reverted?

Quuxplusone commented 7 years ago

(In reply to Renato Golin from comment #7)
> Weird, as off SVN@308006, I am seeing no difference between:
>
> $ clang -target armv7-linux-gnueabihf -S vp9dsp.c -o vp9dsp.s -O3
>
> and
>
> $ clang -target armv7-linux-gnueabihf -S vp9dsp.c -o vp9dsp-novec.s -O3
> -fno-vectorize
>
> ie. it can't vectorise either.
>
> Was the patch that introduced the vectorisation reverted?

No, I don't think so. I still manage to reproduce it, both using the latest
trunk, and with the latest from the release_50 branch, and with SVN r308006
that I just built.

Quuxplusone commented 7 years ago

$ clang -target armv7a-linux-gnueabihf -mcpu=cortex-a15 -mfloat-abi=hard -
mfpu=neon -S vp9dsp.c -o /dev/null -O3 -Rpass-analysis=loop-vectorize
vp9dsp.c:45:5: remark: loop not vectorized: vectorization is not beneficial and
is not explicitly forced [-Rpass-analysis=loop-vectorize]
    for (i = 0; i < 8; i++)
    ^
vp9dsp.c:47:5: remark: loop not vectorized: vectorization is not beneficial and
is not explicitly forced [-Rpass-analysis=loop-vectorize]
    for (i = 0; i < 8; i++)

I'm building trunk at r310649...

Quuxplusone commented 7 years ago

(In reply to Renato Golin from comment #9)
> $ clang -target armv7a-linux-gnueabihf -mcpu=cortex-a15 -mfloat-abi=hard
> -mfpu=neon -S vp9dsp.c -o /dev/null -O3 -Rpass-analysis=loop-vectorize
> vp9dsp.c:45:5: remark: loop not vectorized: vectorization is not beneficial
> and is not explicitly forced [-Rpass-analysis=loop-vectorize]
>     for (i = 0; i < 8; i++)
>     ^
> vp9dsp.c:47:5: remark: loop not vectorized: vectorization is not beneficial
> and is not explicitly forced [-Rpass-analysis=loop-vectorize]
>     for (i = 0; i < 8; i++)
>
> I'm building trunk at r310649...

With those options, I'm getting these messages:

$ clang -target armv7a-linux-gnueabihf -mcpu=cortex-a15 -mfloat-abi=hard -
mfpu=neon -S vp9dsp.c -o vp9dsp.s -O3 -Rpass-analysis=loop-vectorize
vp9dsp.c:45:5: remark: the cost-model indicates that interleaving is not
      beneficial [-Rpass-analysis=loop-vectorize]
    for (i = 0; i < 8; i++)
    ^
vp9dsp.c:47:5: remark: the cost-model indicates that interleaving is not
      beneficial [-Rpass-analysis=loop-vectorize]
    for (i = 0; i < 8; i++)

I'm attaching the preprocessed source, in case that helps you reproduce the
issue, if the system headers differ in a way that makes a difference here.

Quuxplusone commented 7 years ago

Right, this was not a loop vectoriser issue, but an SLP issue. Updating to trunk made it show up with the same Rpass warnings you got (which are not from SLP).

Though, I have to say, the output is huge and it'll be hard to identify what went wrong just by looking at the assembly.

Can you run a creduce on vp9dsp.c with the make file you have?

Quuxplusone commented 7 years ago

How would I ideally run creduce in this case? When it's two different translation units - creduce would either need to not touch the external interface, or update it in sync.

Quuxplusone commented 7 years ago

I think that's where your hack to change the function name comes in handy. :)

If you compile the same function from the same file in different ways (with and without vectorize), using it in the main file (which you don't update), then running make, then running the program and outputting errors if they differ should suffice, no?

Means whatever change you do for one, does it for both, which is what we want.

Quuxplusone commented 7 years ago

Yes, in that sense it'd be straightforward. But AFAIK with creduce, I can only give two return values, whether the version is interesting or not - I can't indicate that the modified vp9dsp.c fails to maintain the interface to main.c so that it will fail for some unrelated reason (or, say, fail to link, if it'd rename the entry point function).

Quuxplusone commented 7 years ago

Ok, if I just sort all such cases under "not interesting", I guess it'd work. I need to try it out, but that'll be tomorrow or later, unfortunately.

Quuxplusone commented 7 years ago

I managed to run creduce on it, but it soon degenetated into just copying the uninitialized tmp stack buffer into dst, giving the same effect. So I guess I'd at least need to check e.g. that the vectorizer did something...

Quuxplusone commented 7 years ago

Can you attach the reduced case? I'm guessing this has nothing to do with the vectorisation itself...

Remember that the vectorisers have dependency passes, so when you disable it, some other passes also don't run in the same way. It could be an artefact of these other passes.

It'd be good to see the debug output from opt for both "-O3" and "-O3 -fno-vectorise" and see where they diverge.

Quuxplusone commented 7 years ago

Attached vp9dsp.i (22813 bytes, text/plain): Preprocessed C source

Quuxplusone commented 7 years ago

Attached vp9dsp.c (1603 bytes, text/plain): Slightly simplified reproduction case

Quuxplusone commented 7 years ago

$ opt -print-after-all file.ll

should give you the function's IR after every pass.

Normally, you'd generate -O0 IR, then pass opt on it with different options and see the IR change depending on them. If you pass on an IR that has already been optimised, you might not see many differences.

Quuxplusone commented 7 years ago

Thanks, now we might be getting somewhere.

You're right that I shouldn't have used -O3 to produce the IR using clang, since it turns out that IR (that I attached) already was vectorized. Then on the other hand, if I produce the IR using -O0 with clang, opt won't vectorize it either. So I'd need to have clang optimize the IR it outputs, to the point that opt actually can vectorize it (or whichever other step breaks it).

Quuxplusone commented 7 years ago

You need to figure out what options clang is passing to make it vectorise that
opt doesn't assume from the triple in the IR.

If you call clang with "-###" it'll tell you all the target options it passes
down, which may clue on what options you need to pass to opt.

Common ones that make it not vectorise is the lack of "-mfpu=neon", "-mfloat-
abi=hard" and to make sure you're getting a A-class core, either by "-
march=armv7a" or "-mcpu=cortex-a15" or similar.

Unfortunately, not all target features are passed down to the IR. even if most
of the logic is persisted as function attributes, they may not be enough to
help the passes to infer what you wanted in clang.

This may sound unfortunate, but it's sort of a design decision. Since opt and
llc are internal tools (not user facing), you must repeat some of the options
to make sure you keep the intended behaviour. This way, you can *change* the
intended behaviour and debug optimisation passes in different ways, that
wouldn't be possible with just clang.

Quuxplusone commented 7 years ago

Attached vp9dsp.ll (15394 bytes, text/plain): LLVM IR for the testcase

Quuxplusone commented 7 years ago

Attached vp9dsp.ll (5952 bytes, text/plain): LLVM IR for the testcase, prior to vectorization

Quuxplusone commented 7 years ago

Attached opt-vectorize (645857 bytes, text/plain): opt output for a run with vectorization

Quuxplusone commented 7 years ago

Attached opt-novectorize (465918 bytes, text/plain): opt output for a run without vectorization

Quuxplusone commented 7 years ago

Right, at least you have isolated everything *before* the vectorizer. I thought
the "stop-after" would work, but that's ok, if you just copy and paste the IR
after the loop vectorizer back into the file, it should work, as long as you
only have one function.

Another option is to use opt to force specific passes, instead of "-O3", so
that you can play with them more freely. If you're feeling adventurous, you can
also try "bugpoint", which does a similar IR handling. Unfortunately, I haven't
used it that much, so can't help you set it up.

Quuxplusone commented 7 years ago

Yeah, I tested copypasting the IR from the step output, but it fails to build due to "error: use of undefined metadata '!7'" (The original input only has got metadata !1 - !6.) Not sure if it's ok to just fill in empty metadata entries for the ones I'm missing? opt only seems to give me bitcode, not textual IR, for the final result - would the metadata entries match for that one as well in case I find how to convert that bitcode back into textual IR?

Quuxplusone commented 7 years ago

Attached vp9dsp-vectorized.ll (15309 bytes, text/plain): IR output after the loop vectorizer pass

Quuxplusone commented 7 years ago

I got some more progress with figuring out what's going wrong...

The error seems to lie in the following segment of the input IR:

  %120 = shufflevector <4 x i16> %96, <4 x i16> %98, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %121 = shufflevector <4 x i16> %101, <4 x i16> %104, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %122 = shufflevector <4 x i16> %107, <4 x i16> %110, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %123 = shufflevector <4 x i16> %113, <4 x i16> %116, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %124 = shufflevector <8 x i16> %120, <8 x i16> %121, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %125 = shufflevector <8 x i16> %122, <8 x i16> %123, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %126 = shufflevector <16 x i16> %124, <16 x i16> %125, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
  %interleaved.vec = shufflevector <32 x i16> %126, <32 x i16> undef, <32 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 24, i32 28, i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 25, i32 29, i32 2, i32 6, i32 10, i32 14, i32 18, i32 22, i32 26, i32 30, i32 3, i32 7, i32 11, i32 15, i32 19, i32 23, i32 27, i32 31>

One small segment of that translates into this MIR:

    dead %d8, %d1 = VZIPd16 killed %d8, killed %d1, 14, _
    %d28 = VEXTd16 %d7, killed %d28, 2, 14, _
    %d29 = VEXTd16 %d7, killed %d29, 2, 14, _
    %d7 = VEXTd16 killed %d7, %d6, 2, 14, _
    %d8 = VEXTd16 %d26, %d3, 2, 14, _
    %d31 = VEXTd16 %d4, killed %d31, 2, 14, _
    %d2 = VEXTd16 %d26, killed %d2, 2, 14, _
    dead %d30, %d18 = VZIPd16 killed %d30, killed %d18, 14, _
    %d19 = VEXTd16 killed %d26, killed %d19, 2, 14, _
    %d26 = VEXTd16 killed %d3, killed %d5, 2, 14, _
    %d28 = VEXTd16 killed %d28, %d1, 2, 14, _
    %d29 = VEXTd16 killed %d29, killed %d1, 2, 14, _          ---- this should use d8 instead of d1

If I manually change that last vext instruction in the output assembly to use
d8 instead of d1 (and move it up, since d8 is clobbered inbetween), and build
using that, I at least get this particular testcase to run correctly.

How do I dig into what happens inbetween the vectorized IR and getting it into
this form? I guess one would have to add lots of printf debugging in
ARMISelLowering.cpp to figure that out? I found that if I add -pre-RA-
sched=linearize I get code that is much easier to follow, which also exhibits
the same issue.

I haven't yet looked at the code in that form yet though - I guess that'd be
the next step.

Quuxplusone commented 7 years ago

Hi Martin,

Great progress!

If the bug can be reproduced in QEMU, you could bisect on a fast machine and
just run it in user-mode QEMU. Seeing the commit that caused it, now that we
know what's going on with great accuracy, hopefully, would be much easier to
pinpoint the fix.

cheers,
--renato

Quuxplusone commented 7 years ago

I actually already did bisect it, and found that it started breaking from this
commit, SVN r306803:

commit bfae62c2cbac460e01f5d4bbbaae459e17ad2437
Author: Ayal Zaks <ayal.zaks@intel.com>
Date:   Fri Jun 30 08:02:35 2017 +0000

    [LV] Optimize for size when vectorizing loops with tiny trip count

This commit itself only touches the generic vectorization logic (IIRC this
piece of code wasn't vectorized at all before that). So the issue itself has
been there all along (or at least for some time), but only started getting
exposed now.

I'll try to pick it up, digging further into how the shuffles actually are
lowered, and which part does a vzip16 but picks up the wrong half of the result.

Quuxplusone commented 7 years ago

(In reply to Martin Storsjö from comment #32)

> I'll try to pick it up, digging further into how the shuffles actually are
> lowered, and which part does a vzip16 but picks up the wrong half of the
> result.

I forgot half the sentence - I'll try to pick it up later - in case someone
else wants to have a look at the shuffles and vzips.

Quuxplusone commented 7 years ago

Right, sorry, I thought that was a commit that you noticed, not the actual first commit.

In that case, it may be more complicated, because it seems that the problem is a mix of vectorisation, lowering and register allocation.

Neither the IR nor the MIR look particularly bad. Do you have any reason why it should use D8 instead of D1, other than "it works that way"?

At that time, D1 is dead but D8 is not, so the allocator is doing the right thing. As you said, moving that instruction up would pick d8 before it was clobbered, but that's the kind of decision you can't rely on.

So, there seems to be something missing in between. Either the lowering to MIR is not linearising correctly (missing kills?), or the allocator is messing things up.

Regardless, that patch is an optimisation for code size only, which is a use case narrow enough that I think we should revert it until we can sort this problem.

Ayal, is there any other patches that depend on that one? Can we revert until we figure out what the problem is?

--renato

Quuxplusone commented 7 years ago

That patch is an optimization for performance in general. It leverages the code-size objective internally, to prevent potential slowdowns due to overheads associated with increasing the size.

It should be standalone. Would you like to introduce some target-dependent flag to disable it?

Quuxplusone commented 7 years ago

Hi Ayal, no, this problem doesn't seem to be strictly NEON-related, even though it just happens on ARM. I think there's more at play here, and we don't know if this have actually affected any other arch.

Given that this will go into 5.0, and it means not able to compile VP9, I think it would be safer to revert it for now, investigate, then bring it back again. Being it stand-alone, there won't be a huge head-ache to revert/reapply.

Makes sense?

Quuxplusone commented 7 years ago

(In reply to Renato Golin from comment #36)
> Given that this will go into 5.0, and it means not able to compile VP9

FWIW, in libavcodec where I noticed it, it hits the reference C implementation
of this function - but in practice, there's also a hand-written assembly
version of this function which will be used anywhere llvm would vectorize this
anyway. So the only thing it actually breaks is the test of the assembly
function, where it points out that the assembly version differs from the
reference C version.

That said, I don't mind if this is reverted for 5.0 - that at least gives
cleaner testsuite results for libavcodec :-) But it looks like the root cause
is pretty much independent of the commit that triggered it; I guess other
functions could be vectorized with similar shuffles - and break, even if that
particular commit is reverted.

Quuxplusone commented 7 years ago

Martin, that is true. If the codecs still work (even if the compiler is
actually broken), I'm ok with it staying in tree until we fix.

I agree the patch itself is not a sure way of making it not happen again, I was
just making sure we don't regress seriously on known programs.

If you're ok with us investigating while in tree, I think it will be better
this way.

cheers,
--renato

Quuxplusone commented 7 years ago

Yeah, as long as this is the only case where it triggers so far, there's no urgent need to revert. And if lucky, we'll still find the root cause before 5.0 ships (and have enough confidence in the fix to backport it with good confidence).

Quuxplusone commented 7 years ago

This is the only issue I'm aware of that is raised by this patch.
Agreed, it should be quite easy to revert/reapply being a small standalone
patch. But it does provide some high single-digit percent of improvements for a
few benchmarks.

Quuxplusone commented 7 years ago

Attached shuffle.ll (676 bytes, text/plain): Almost minimal IR that shows the problematic shuffles

Quuxplusone commented 7 years ago

CC:ing Ahmed, who wrote that particular piece of code that produces the surprising/incorrect results here.

Quuxplusone commented 7 years ago

So the real culprit seems to be in SVN r240118:

commit ac655060b4d484911579ae663e231da8c23d4211
Author: Ahmed Bougacha <ahmed.bougacha@gmail.com>
Date:   Fri Jun 19 02:32:35 2015 +0000

    [ARM] Look through concat when lowering in-place shuffles (VZIP, ..)

If I disable that particular code block
    if (V1->getOpcode() == ISD::CONCAT_VECTORS && V2->isUndef()) {
in LowerVECTOR_SHUFFLE in ARMISelLowering.cpp, the C level test case also runs
correctly now, and the full case in libavcodec also seems to be built correctly.

Quuxplusone commented 7 years ago

Just for reference... With the problematic optimization disabled, shuffle.ll
gets compiled correctly into this:

$ llc shuffle.ll -o -
        vldr    d16, [r1]
        vldr    d19, [r0]
        vtrn.16 d19, d16
        vst1.64 {d18, d19}, [r0]

Quuxplusone commented 7 years ago

I posted a first attempt at a fix in https://reviews.llvm.org/D36899, although I'm not at all sure if this is correct. Since it at most skips some case of optimization, it should be rather safe though.

Quuxplusone commented 7 years ago

After some more thinking about the issue, I came up with something I thought was a good fix, and when trying to expand that fix to all related functions, I noticed that it already was fixed there in isVTRNMask (fixed in SVN r247254, "[ARM] Do not use vtrn for vectorshuffle if the order is reversed", by Jeroen Ketema, applied by James Molloy).

So I updated the fix in https://reviews.llvm.org/D36899, which should now be a serious patch that I hope can be reviewed, and later backported, soon.

Quuxplusone commented 7 years ago

The fix was committed in r311258 - Hans, can you backport?

Quuxplusone commented 7 years ago

Hi Hans, patch looks good for backporting.

Cheers,
Renato

Quuxplusone commented 7 years ago

(In reply to Renato Golin from comment #48)
> Hi Hans, patch looks good for backporting.
>
> Cheers,
> Renato

Merged in r311369. Thanks!

Quuxplusone / LLVMBugzillaTest

ARM NEON vectorization miscompiles code in 5.0 #32893