Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Possible vectorization or CodeGen bug #28134

Open Quuxplusone opened 8 years ago

Quuxplusone commented 8 years ago
Bugzilla Link PR28135
Status NEW
Importance P normal
Reported by Pirama Arumuga Nainar (pirama@google.com)
Reported on 2016-06-15 01:35:37 -0700
Last modified on 2016-08-05 10:40:48 -0700
Version trunk
Hardware All All
CC arnaud.degrandmaison@arm.com, hfinkel@anl.gov, kristof.beyls@arm.com, llvm-bugs@lists.llvm.org, rengolin@gmail.com, silviu.baranga@arm.com, spatel+llvm@rotateright.com, srhines@google.com
Fixed by commit(s)
Attachments codec-loop.c (1944 bytes, application/octet-stream)
bug.ll (2366 bytes, application/octet-stream)
Blocks PR21420
Blocked by
See also
Created attachment 16545
Simplified function for possible vectorization bug

[Tentatively marking "loop optimization" component this repros in both AArch64
and ARM targets]

This relates to the Android issue mentioned in https://android-
review.googlesource.com/#/c/235875/.

The function 'loop' in the attached 'codec-loop.c' function produces incorrect
output when the 'for' loop in it is vectorized and called from that source file
(https://android-
review.googlesource.com/#/c/235875/2/media/libstagefright/codecs/amrwbenc/src/voAMRWBEnc.c).
The incorrect output causes corruption in the codec's output.

The failure reproduces on both AArch64 and ARM32.  Various workarounds fix the
failure (adding prints in the loop body, turning on integer sanitization etc)
but they all succeed only because vectorization and vector instructions don't
get used.

I am trying to find the right set of inputs that trigger the issue in the
reduced "loop" functiono.  I'll post it here if I am successful.

PS: I tested this on Android's Clang, which is a bit behind trunk.  However,
LLVM IR from both ToT and Android's clang was the same, except for very minor
differences).
Quuxplusone commented 8 years ago

Attached codec-loop.c (1944 bytes, application/octet-stream): Simplified function for possible vectorization bug

Quuxplusone commented 8 years ago
Hi Pirama,

Thanks for the reproducible case. Though I'm at a loss of how to test this.

As we discussed earlier, it seems that the L_mult is the problem here, as if
you change line:

  L_var_out *= 2;

to

  L_var_out>>= 1;

it works with NEON instructions, right?

As I said, in the interim, keep the pragma vectorize(disable) so we can fix
this in the best way.

But for now, what we really need is a main function that will show different
output with a training dataset. Looking at the IR may not give us much to go
for, I'm afraid.
Quuxplusone commented 8 years ago

Arnaud, this is the NEON bug that I mentioned on today's call. Note that the reduced case here doesn't quite work yet, but we can help with getting someone running this in AOSP if you like. It reproduces there 100% of the time, and we can tweak small options in the source to make it disappear/reappear, so it might just be simpler to work from that direction.

Quuxplusone commented 8 years ago

Should we link it to the PR21420 (Android meta bug) ?

Quuxplusone commented 8 years ago

Yes, done.

Quuxplusone commented 8 years ago
(In reply to comment #2)
> and we can tweak small options in the source to make it
> disappear/reappear, so it might just be simpler to work from that direction.

Steve, I don't recommend you tweak the code to make it disappear, as that'll
also be unstable and could catch you in the future.

The most sensible work around for now is to use the #pragma, and add a FIXME
pointing to this bug.
Quuxplusone commented 8 years ago

Sorry, my comment wasn't complete last time. We aren't hacking this to make the bug disappear. As the code is today, the bug no longer reproduces (because it is using safe shift operations instead of the original code that exhibited UB). The reason is that the new code doesn't generate NEON at all, and instead just generates a simple non-vectorized loop. Pirama found a way to get it to generate NEON in this case again (without UB), and that still exhibits the original bug (corrupt audio data). I will let him explain further about how to trigger this under the latest code.

Quuxplusone commented 8 years ago
(In reply to comment #6)
> Sorry, my comment wasn't complete last time. We aren't hacking this to make
> the bug disappear. As the code is today, the bug no longer reproduces
> (because it is using safe shift operations instead of the original code that
> exhibited UB). The reason is that the new code doesn't generate NEON at all,
> and instead just generates a simple non-vectorized loop. Pirama found a way
> to get it to generate NEON in this case again (without UB), and that still
> exhibits the original bug (corrupt audio data). I will let him explain
> further about how to trigger this under the latest code.

An unrelated patch in the vectorizer to make that shift work would break it
again... Regardless of the solution or current state, I still strongly
recommend the pragma as a stop-gap work around.
Quuxplusone commented 8 years ago

Renato, your concern is valid. Our plan is to fix this the right way before the next compiler update in Android. We'll either include the proper fix or use the pragma disabling vectorization if we are not able to fix this by then.

I'll reduce the AOSP patch (https://android-review.googlesource.com/#/c/235875/) and add an explanation here. Like Steve mentioned earlier, I've been unsuccessful in reproducing this in a standalone fashion.

Quuxplusone commented 8 years ago
(In reply to comment #8)
> Renato, your concern is valid.  Our plan is to fix this the right way before
> the next compiler update in Android.  We'll either include the proper fix or
> use the pragma disabling vectorization if we are not able to fix this by
> then.

Sounds like a plan. :)

> I'll reduce the AOSP patch
> (https://android-review.googlesource.com/#/c/235875/) and add an explanation
> here.  Like Steve mentioned earlier, I've been unsuccessful in reproducing
> this in a standalone fashion.

I suspected as much. :(
Quuxplusone commented 8 years ago

I also tried to reproduce this by using LLVM ToT and (a lot of) randomly generated data for the loop. As far as I can see, the results of the vectorized version are the same as the non-vectorized version. I suspect there's either some alignment issue with the data that we're not seeing here or the vectorized version of the loop is correct but it somehow triggers a bug somewhere else?

The IR also looks correct after vectorization (at least as far as I can tell).

The best way to progress this seems to be to investigate the source.

Quuxplusone commented 8 years ago
Here's my findings so far.  Please refer to https://android-
review.googlesource.com/#/c/235875 for the actual sources.  Repro steps are at
the end of this comment.

Function 'coder' in voAMRWBEnc.c has a loop that is miscompiled.
Miscompilation is detected by comparing against the output of non-vectorized
loop processing the same data.  (See line 1254 onwards in voAMRWBEnc.c)

I've extracted the loop out to function 'loop_incorrect' in the same file.  I
also copied the function to a separate file (loop_correct.c).  Calling
loop_incorrect causes the output to be different than the non-vectorized loop.
OTOH calling loop_correct from 'coder' passes (i.e. gives the same result as
the non-vectorized loop).

I've included additional information in https://android-
review.googlesource.com/#/c/235875/.  loop_correct.ll and loop_incorrect.ll
have the LLVM IR for the two functions.  One notable difference between the two
is that the inlined-version of L_shl2 gets compiled differently.  AFAICT, the
two differences are sematically equivalent.  (cmd-loop_incorrect.sh and cmd-
loop_correct.sh can be used from root of AOSP tree to compile the sources and
emit .bc files).

I've also added another file, standalone_main.c that calls loop_incorrect and
loop_correct from a standalone executable on an input that fails when called
within coder.  Unfortunately, the two functions produce identical output as the
non-vectorized loop.

It is remotely possible that when called from standalone_main.c, the checks for
array overlaps at the top of the vectorized loop fail and we fall back to the
scalar loop but I can't imagine why this should happen.

===================================
Repro steps:

Run the following for an Android device:
- adb root
- adb shell chmod 777 /data/local/tmp
- adb shell setenforce 0

Apply https://android-review.googlesource.com/#/c/235875/ and rebuild:
$ cd $ANDROID_BUILD_TOP/frameworks/av/media/libstagefright/codecs/amrwbenc
$ git fetch https://android.googlesource.com/platform/frameworks/av
refs/changes/75/235875/3 && git cherry-pick FETCH_HEAD
$ mma # Add -jn for parallel build w/ n processes
$ adb sync

Apply https://android-review.googlesource.com/#/c/235874/:
$ cd $ANDROID_BUILD_TOP/cts
$ git fetch https://android.googlesource.com/platform/cts
refs/changes/74/235874/1 && git cherry-pick FETCH_HEAD
$ m cts # Add -jn for parallel build w/ n processes

Run test:
$ cts-tradefed run cts --k --class android.media.cts.EncoderTest --method
'testAMRWBEncoders' --force-abi 64

# Follow 'adb logcat' output in another shell
$ adb logcat | grep MEDIA

Successful runs will print several "outputs were identical" while failed runs
will log "different @..."
Quuxplusone commented 7 years ago

Attached bug.ll (2366 bytes, application/octet-stream): Reproduce with opt -O3 bug.ll -S -o -