[Loop vectorizer] Run-time failures in test-suite with AVX512F

Quuxplusone commented 7 years ago


Bugzilla Link	PR31671
Status	RESOLVED FIXED
Importance	P normal
Reported by	Elad Cohen (elad2.cohen@intel.com)
Reported on	2017-01-17 08:27:35 -0800
Last modified on	2017-01-21 23:42:45 -0800
Version	trunk
Hardware	PC Windows NT
CC	ayal.zaks@intel.com, elena.demikhovsky@intel.com, llvm-bugs@lists.llvm.org, mkuper@google.com, mssimpso@codeaurora.org, zvirack@gmail.com
Fixed by commit(s)
Attachments	`tsc.ll` (3239970 bytes, application/octet-stream) `tsc.revert.ll` (3239853 bytes, application/octet-stream)
Blocks
Blocked by
See also

Created attachment 17849 test compiled with TOT clang -marh=knl -S -emit-llvm

The tests:

test-suite MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl test-suite MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt

fail in runtime (wrong output) when built with AVX512 (and executed on KNL target) - specifically the function s351() returns a wrong result.

Initially, it seems that is started failing due to the combination of r290810 and r284779. But these are actually just changes to the cost model and enabling interleaved memory accesses by default for X86. Bisecting the erroneous function while compiling with force-vector-width and enable-interleaved-mem-accesses seems to lead to r282418.

I've attached the .ll files produced by compiling with -march=knl using TOT clang VS TOT+revert-r282418.

The C code that contains the diff in IR is: int s351() { ... TYPE alpha = c[0]; for (int nl = 0; nl < 8ntimes; nl++) { for (int i = 0; i < LEN; i += 5) { a[i] += alpha b[i]; a[i + 1] += alpha b[i + 1]; a[i + 2] += alpha b[i + 2]; a[i + 3] += alpha b[i + 3]; a[i + 4] += alpha b[i + 4]; } dummy(a, b, c, d, e, aa, bb, cc, 0.); } .. check(1); return 0; }

Putting aside the fact the the loop wasn't re-rolled, from the diff it looks like the generated IR is doing something wrong(See that attached files for a full context):

TOT: %10 = add <16 x i64> %broadcast.splat, <i64 1, i64 6, i64 11, i64 16, i64 21, i64 26, i64 31, i64 36, i64 41, i64 46, i64 51, i64 56, i64 61, i64 66, i64 71, i64 76> ... %VectorGep86 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 0, <16 x i64> %10 call void @llvm.masked.scatter.v16f32(<16 x float> %15, <16 x float> %VectorGep86

TOT+revert: %10 = or i64 %offset.idx, 1 %broadcast.splatinsert86 = insertelement <16 x i64> undef, i64 %10, i32 0 %broadcast.splat87 = shufflevector <16 x i64> %broadcast.splatinsert86, <16 x i64> undef, <16 x i32> zeroinitializer ... %VectorGep88 = getelementptr inbounds %struct.GlobalData, %struct.GlobalData @global_data, i64 0, i32 0, <16 x i64> %broadcast.splat87 call void @llvm.masked.scatter.v16f32(<16 x float> %14, <16 x float> %VectorGep88

In the latter it seems that the scatter writes to a splat of the i+1 instead of <i+1,i+6,i+11,..,i+76>

Quuxplusone commented 7 years ago

Attached tsc.ll (3239970 bytes, application/octet-stream): test compiled with TOT clang -marh=knl -S -emit-llvm

Quuxplusone commented 7 years ago

Attached tsc.revert.ll (3239853 bytes, application/octet-stream): test compiled with TOT clang + revert r282418 -march=knl -S -emit-llvm

Quuxplusone commented 7 years ago

Thanks for reporting, Elad. I'll take a look.

Quuxplusone commented 7 years ago

Patch submitted for review: https://reviews.llvm.org/D28819.

Quuxplusone commented 7 years ago

Patch committed in r292254. Please let me know if this doesn't resolve the problem.

Quuxplusone / LLVMBugzillaTest

[Loop vectorizer] Run-time failures in test-suite with AVX512F #30644