Closed jepio closed 9 years ago
Hi, thanks for the report. The example is updated.
Best regards,
Software Engineer Intel Compiler Team Intel Corp.
23.11.2014 18:39, jepio пишет:
After taking a look at the Matrix Multiplication gist linked from the github pages (link https://gist.github.com/clang-omp/89efe59c6f79ae2bc204) I was puzzled for a moment when I came to the parallel implementation:
pragma omp parallel for collapse(2)
for (i =0; i < N; ++i) { for (j =0; j < N; ++j) { int Sum =0;
pragma omp parallel for
for (k =0; k < N; ++k) { Sum += A[i][k] \* B[k][j]; } C1[i][j] = Sum; }
}
I asked myself: what is a |#pragma omp parallel for| doing inside another parallel for ? After a while I understood that the point is to tell the omp parser that this is a for loop (privatize loop variable), but a side effect is that each time a thread arrives at that line it tries to start a new team of threads (which doesn't work and only introduces overhead). Since nested parallelism is disabled by default and even if it weren't it would probably not be good for performance, I think this example should be changed:
- remove the pragma at line 40
- add |private(k)| to the pragma at line 36 <-- /this is effectively what the second pragma is currently doing/
The real consideration here is performance. I have tested the current version and the new one with clang-omp on a 16 core machine. For a matrix size of 1000x1000 I consistently see the following results:
Mode Time (s) Serial 18.70 Parallel (with pragma) 1.45 Parallel (with private) 1.20
A second reason for the change is simply to encourage good style.
— Reply to this email directly or view it on GitHub https://github.com/clang-omp/clang/issues/53.
After taking a look at the Matrix Multiplication gist linked from the github pages (link) I was puzzled for a moment when I came to the parallel implementation:
I asked myself: what is a
#pragma omp parallel for
doing inside another parallel for ? After a while I understood that the point is to tell the omp parser that this is a for loop (privatize loop variable), but a side effect is that each time a thread arrives at that line it tries to start a new team of threads (which doesn't work and only introduces overhead). Since nested parallelism is disabled by default and even if it weren't it would probably not be good for performance, I think this example should be changed:private(k)
to the pragma at line 36 <-- this is effectively what the second pragma is currently doingThe real consideration here is performance. I have tested the current version and the new one with clang-omp on a 16 core machine. For a matrix size of 1000x1000 I consistently see the following results:
A second reason for the change is simply to encourage good style.