JeffersonLab / qphix

QCD for Intel Xeon Phi and Xeon processors
http://jeffersonlab.github.io/qphix/
Other
13 stars 11 forks source link

Tuning within CG is disabled. Why? #29

Closed martin-ueding closed 7 years ago

martin-ueding commented 7 years ago

While looking through the code, Peter found that the tuning in the CG is disabled:

    void tune()
    {
      int iters=100;
#if 0
      tuneCopyThreads(iters);
      tuneAypxThreads(iters);
      tuneNorm2Threads(iters);
      tuneXMYNorm2Threads(iters);
      tuneRXUpdateThreads(iters);

      reportTuning();
#endif
    }

Looking at git blame, I found that most of the code has been copied from the predecessor cpp_wilson_dslash. However, the #if 1 has been replaced with #if 0 in commit 939060e97df2d12e39dc87970a2dddfd44fa02c3. Also now the number of threads is taken from the -sy and -sz command line arguments via the Geometry object.

To me it seems better to have this auto-tuned for the individual kernels instead of using a fixed number of threads for all kernels. Was the auto-tuning not worth the effort? Or is using a fixed number of threads better for performance?

bjoo commented 7 years ago

Hi Martin, I think the issue here was that I ended up dead-ending the automatic tuning of BLAS threads. The idea was that it may be useful to have different number of threads in the BLAS than in the Dslash (where we fixed everything with the sy and sz parameters). But in the initial phases, I got lots of segfaults from the BLAS. So moving to KNL we had to rework some of that and I think I got rid of the concept of having separate number of BLAS threads from Dslash threads.

Best, B

On Apr 9, 2017, at 6:28 AM, Martin Ueding notifications@github.com wrote:

While looking through the code, Peter found that the tuning in the CG is disabled:

void tune

() {

int iters=100 ; # if 0 tuneCopyThreads(iters); tuneAypxThreads(iters); tuneNorm2Threads(iters); tuneXMYNorm2Threads(iters); tuneRXUpdateThreads(iters);

  reportTuning();

endif

}

Looking at git blame, I found that most of the code has been copied from the predecessor cpp_wilson_dslash. However, the #if 1 has been replaced with #if 0 in commit 939060e. Also now the number of threads is taken from the -sy and -sz command line arguments via the Geometry object.

To me it seems better to have this auto-tuned for the individual kernels instead of using a fixed number of threads for all kernels. Was the auto-tuning not worth the effort? Or is using a fixed number of threads better for performance?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.


Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org