Open Quuxplusone opened 4 years ago
Bugzilla Link | PR46888 |
Status | NEW |
Importance | P enhancement |
Reported by | Joel Hutton (joel.hutton@arm.com) |
Reported on | 2020-07-29 03:31:35 -0700 |
Last modified on | 2021-08-04 14:05:45 -0700 |
Version | trunk |
Hardware | Other Linux |
CC | david.bolvansky@gmail.com, david.green@arm.com, efriedma@quicinc.com, florian_hahn@apple.com, lebedev.ri@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com |
Fixed by commit(s) | |
Attachments | |
Blocks | PR46929 |
Blocked by | |
See also |
From clang, we get the values in the inner loop extended to i32 (note that
example is without the abs() call).
%20 = load i8*, i8** %5, align 8
%21 = load i32, i32* %11, align 4
%22 = sext i32 %21 to i64
%23 = getelementptr inbounds i8, i8* %20, i64 %22
%24 = load i8, i8* %23, align 1
%25 = zext i8 %24 to i32
%26 = load i8*, i8** %7, align 8
%27 = load i32, i32* %11, align 4
%28 = sext i32 %27 to i64
%29 = getelementptr inbounds i8, i8* %26, i64 %28
%30 = load i8, i8* %29, align 1
%31 = zext i8 %30 to i32
%32 = sub nsw i32 %25, %31
%33 = load i32, i32* %9, align 4
%34 = add nsw i32 %33, %32
store i32 %34, i32* %9, align 4
br label %35
I think we could narrow the width of the extends in SLPVectorizer and extend
the result, so extend the operands to i16, compute on i16, extend result to
i32, store. It seems like w have to drop the nsw flags though.
https://alive2.llvm.org/ce/z/bFjAHs
I think the entire sum in the original testcase fits into 16-bit arithmetic, since it's the sum of 256 8-bit numbers. Granted, it's a little tricky to detect that.
https://godbolt.org/z/3e9hY5GEa
On trunk we're now recognising the reductions (aarch64 is unrolling the outer
loop, x86_64 is not but the inner loop is where the fun is):
%12 = bitcast i8* %10 to <16 x i8>*
%13 = load <16 x i8>, <16 x i8>* %12, align 1
%14 = zext <16 x i8> %13 to <16 x i32>
%15 = bitcast i8* %11 to <16 x i8>*
%16 = load <16 x i8>, <16 x i8>* %15, align 1
%17 = zext <16 x i8> %16 to <16 x i32>
%18 = sub nsw <16 x i32> %14, %17
%19 = call <16 x i32> @llvm.abs.v16i32(<16 x i32> %18, i1 true)
%20 = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> %19)
I think that SAD pattern is clear enough now that we could get value tracking
to recognise the upper bits of the reduction result aren't critical, and reduce
the width we extend to.
So I guess the next step is to add llvm.vector.reduce.add support to the
computeKnownBits/computeNumSignBits implementations.