Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Missing vectorization for step 4 #37303

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR38330
Status NEW
Importance P enhancement
Reported by David Bolvansky (david.bolvansky@gmail.com)
Reported on 2018-07-26 05:01:37 -0700
Last modified on 2019-10-29 08:50:42 -0700
Version trunk
Hardware PC Linux
CC craig.topper@gmail.com, efriedma@quicinc.com, florian_hahn@apple.com, hfinkel@anl.gov, hideki.saito@intel.com, listmail@philipreames.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Hello,

Code:
int vec_step_four(int *restrict A, int *restrict B, int *restrict C, int N) {
    int sum = 0;
    for (int i = 0; i < N; i += 4) {
        C[i] = A[i] + B[i];
        sum += C[i];
    }

    return sum;
}

Clang -O3 currently cannot vectorize this loop.
https://godbolt.org/g/NSHVVL
Quuxplusone commented 6 years ago

LLVM can vectorize the loop, it just chooses not to on x86 without AVX-512. You'd probably have to benchmark to see if the cost model is correct.

Quuxplusone commented 6 years ago
int vec(int *restrict A, int *restrict B, int *restrict C, int N) {
    int sum = 0;
    #pragma clang loop vectorize(enable)
    for (int i = 0; i < N; i += 4) {
        C[i] = A[i] + B[i];
        sum += C[i];
    }

    return sum;
}

int main(int argc, char** argv) {
    if (argc < 2) return 1;
    int A[1024];
    int B[1024];
    int C[1024];
    int s = 0;
    for (int i = 0; i < atoi(argv[1]); ++i) {
        s += vec(A, B, C, 1024);
    }
    return s;
}

PC (Intel Core i7 4720HQ):
With pragma - 6.4 sec
Without pragma - 8,3 sec
Quuxplusone commented 6 years ago

Looks like the improvement comes from interleaving, e.g. "#pragma clang loop interleave_count(2)" by itself improves the performance even though there aren't any actual vector instructions involved.

I don't think we have a cost model for that?