Open Quuxplusone opened 6 years ago
Bugzilla Link | PR38330 |
Status | NEW |
Importance | P enhancement |
Reported by | David Bolvansky (david.bolvansky@gmail.com) |
Reported on | 2018-07-26 05:01:37 -0700 |
Last modified on | 2019-10-29 08:50:42 -0700 |
Version | trunk |
Hardware | PC Linux |
CC | craig.topper@gmail.com, efriedma@quicinc.com, florian_hahn@apple.com, hfinkel@anl.gov, hideki.saito@intel.com, listmail@philipreames.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
LLVM can vectorize the loop, it just chooses not to on x86 without AVX-512. You'd probably have to benchmark to see if the cost model is correct.
int vec(int *restrict A, int *restrict B, int *restrict C, int N) {
int sum = 0;
#pragma clang loop vectorize(enable)
for (int i = 0; i < N; i += 4) {
C[i] = A[i] + B[i];
sum += C[i];
}
return sum;
}
int main(int argc, char** argv) {
if (argc < 2) return 1;
int A[1024];
int B[1024];
int C[1024];
int s = 0;
for (int i = 0; i < atoi(argv[1]); ++i) {
s += vec(A, B, C, 1024);
}
return s;
}
PC (Intel Core i7 4720HQ):
With pragma - 6.4 sec
Without pragma - 8,3 sec
Looks like the improvement comes from interleaving, e.g. "#pragma clang loop interleave_count(2)" by itself improves the performance even though there aren't any actual vector instructions involved.
I don't think we have a cost model for that?