clang-omp / clang

clang with OpenMP 3.1 and some elements of OpenMP 4.0 support
clang-omp.github.com
Other
91 stars 15 forks source link

Improve(/correct) Matrix Multiplication example. #53

Closed jepio closed 9 years ago

jepio commented 9 years ago

After taking a look at the Matrix Multiplication gist linked from the github pages (link) I was puzzled for a moment when I came to the parallel implementation:

#pragma omp parallel for collapse(2)
for (i = 0; i < N; ++i) {
    for (j = 0; j < N; ++j) {
        int Sum = 0;
#pragma omp parallel for
        for (k = 0; k < N; ++k) {
            Sum += A[i][k] * B[k][j];
        }
        C1[i][j] = Sum;
    }
} 

I asked myself: what is a #pragma omp parallel for doing inside another parallel for ? After a while I understood that the point is to tell the omp parser that this is a for loop (privatize loop variable), but a side effect is that each time a thread arrives at that line it tries to start a new team of threads (which doesn't work and only introduces overhead). Since nested parallelism is disabled by default and even if it weren't it would probably not be good for performance, I think this example should be changed:

The real consideration here is performance. I have tested the current version and the new one with clang-omp on a 16 core machine. For a matrix size of 1000x1000 I consistently see the following results:

Mode Time (s)
Serial 18.70
Parallel (with pragma) 1.45
Parallel (with private) 1.20

A second reason for the change is simply to encourage good style.

alexey-bataev commented 9 years ago

Hi, thanks for the report. The example is updated.

Best regards,

Alexey Bataev

Software Engineer Intel Compiler Team Intel Corp.

23.11.2014 18:39, jepio пишет:

After taking a look at the Matrix Multiplication gist linked from the github pages (link https://gist.github.com/clang-omp/89efe59c6f79ae2bc204) I was puzzled for a moment when I came to the parallel implementation:

pragma omp parallel for collapse(2)

for (i =0; i < N; ++i) { for (j =0; j < N; ++j) { int Sum =0;

pragma omp parallel for

     for  (k =0; k < N; ++k) {
         Sum += A[i][k] \* B[k][j];
     }
     C1[i][j] = Sum;
 }

}

I asked myself: what is a |#pragma omp parallel for| doing inside another parallel for ? After a while I understood that the point is to tell the omp parser that this is a for loop (privatize loop variable), but a side effect is that each time a thread arrives at that line it tries to start a new team of threads (which doesn't work and only introduces overhead). Since nested parallelism is disabled by default and even if it weren't it would probably not be good for performance, I think this example should be changed:

  • remove the pragma at line 40
  • add |private(k)| to the pragma at line 36 <-- /this is effectively what the second pragma is currently doing/

The real consideration here is performance. I have tested the current version and the new one with clang-omp on a 16 core machine. For a matrix size of 1000x1000 I consistently see the following results:

Mode Time (s) Serial 18.70 Parallel (with pragma) 1.45 Parallel (with private) 1.20

A second reason for the change is simply to encourage good style.

— Reply to this email directly or view it on GitHub https://github.com/clang-omp/clang/issues/53.