clang-omp / clang

clang with OpenMP 3.1 and some elements of OpenMP 4.0 support
clang-omp.github.com
Other
91 stars 15 forks source link

pointer loads from scalar values not hoisted out of loops #30

Closed hfinkel closed 10 years ago

hfinkel commented 10 years ago

Compiling this with clang-omp:

void tuned_STREAM_Scale(STREAM_TYPE scalar) { ssize_t j;

pragma omp parallel for

    for (j=0; j<STREAM_ARRAY_SIZE; j++)
        b[j] = scalar*c[j];

}

results in IR that looks like this for the main loop:

omp.lb_ub.check_pass: ; preds = %omp.lb.le.global_ub. %17 = load double* %ref3, align 8, !tbaa !6 %18 = load i64* %j.private., align 8, !tbaa !8 %arrayidx = getelementptr inbounds [10000000 x double]* @c, i32 0, i64 %18 %19 = load double* %arrayidx, align 8, !tbaa !6 %mul5 = fmul double %17, %19 %20 = load i64* %j.private., align 8, !tbaa !8 %arrayidx6 = getelementptr inbounds [10000000 x double]* @b, i32 0, i64 %20 store double %mul5, double* %arrayidx6, align 8, !tbaa !6 br label %omp.cont.block

Please note that the captured parameter load that corresponds to 'scalar' in the original source:

%17 = load double* %ref3, align 8, !tbaa !6

is loaded in each loop iteration. Just as with other loads that needed hoisting in issue #27 , this load also needs to be hoisted.

alexey-bataev commented 10 years ago

Hal, this is common problem in code generated by clang. See the IR generated by clang without OpenMP:

for.body:                                         ; preds = %for.cond
  %1 = load double* %scalar.addr, align 8
  %2 = load i64* %j, align 8
  %arrayidx = getelementptr inbounds [10000000 x double]* @c, i32 0, i64 %2
  %3 = load double* %arrayidx, align 8
  %mul = fmul double %1, %3
  %4 = load i64* %j, align 8
  %arrayidx1 = getelementptr inbounds [10000000 x double]* @b, i32 0, i64 %4
  store double %mul, double* %arrayidx1, align 8
  br label %for.inc

See the first line after for.body label, there is exactly the same code. We need an additional optimization pass which will hoist invariants out of loop body. This should be done in backend, not in frontend.

hfinkel commented 10 years ago

Okay; I think that I see what you mean. Generally, Clang will generate a local alloca to hold a local variable, and load/store to that local stack space on sequence-point boundaries. When you generate the outlined OpenMP regions, you simply transport the pointer to the original alloca through the dispatch interface (thus, "capturing" the variable). That being the case, I'm afraid that I agree with your analysis ;)