Closed martin-ueding closed 7 years ago
Hi Martin, I think the issue here was that I ended up dead-ending the automatic tuning of BLAS threads. The idea was that it may be useful to have different number of threads in the BLAS than in the Dslash (where we fixed everything with the sy and sz parameters). But in the initial phases, I got lots of segfaults from the BLAS. So moving to KNL we had to rework some of that and I think I got rid of the concept of having separate number of BLAS threads from Dslash threads.
Best, B
On Apr 9, 2017, at 6:28 AM, Martin Ueding notifications@github.com wrote:
While looking through the code, Peter found that the tuning in the CG is disabled:
void tune
() {
int iters=100 ; # if 0 tuneCopyThreads(iters); tuneAypxThreads(iters); tuneNorm2Threads(iters); tuneXMYNorm2Threads(iters); tuneRXUpdateThreads(iters);
reportTuning();
endif
}
Looking at git blame, I found that most of the code has been copied from the predecessor cpp_wilson_dslash. However, the #if 1 has been replaced with #if 0 in commit 939060e. Also now the number of threads is taken from the -sy and -sz command line arguments via the Geometry object.
To me it seems better to have this auto-tuned for the individual kernels instead of using a fixed number of threads for all kernels. Was the auto-tuning not worth the effort? Or is using a fixed number of threads better for performance?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
While looking through the code, Peter found that the tuning in the CG is disabled:
Looking at
git blame
, I found that most of the code has been copied from the predecessorcpp_wilson_dslash
. However, the#if 1
has been replaced with#if 0
in commit 939060e97df2d12e39dc87970a2dddfd44fa02c3. Also now the number of threads is taken from the-sy
and-sz
command line arguments via theGeometry
object.To me it seems better to have this auto-tuned for the individual kernels instead of using a fixed number of threads for all kernels. Was the auto-tuning not worth the effort? Or is using a fixed number of threads better for performance?