RRZE-HPC / stempel

Stencil TEMPlate Engineering Library
GNU Affero General Public License v3.0
6 stars 2 forks source link

Likwid calls in compilable file #17

Closed sguera closed 7 years ago

sguera commented 7 years ago

Replace likwid calls by Macros:

  likwid_markerStartRegion("Sweep");

by

  LIKWID_MARKER_START("Sweep");

this allows to compile the code without likwid being available

TomTheBear commented 7 years ago

Best is when you put an ifdef arround each LIKWID call and the include of the LIKWID header:

#ifdef LIKWID_PERFMON
#include <likwid.h>
#endif
[...]
#ifdef LIKWID_PERFMON
LIKWID_MARKER_START("Sweep");
#endif
sguera commented 7 years ago

thanks @TomTheBear I'll do like that

sguera commented 7 years ago

@TomTheBear Does it need to be this way (with the omp parallel)?

#ifdef LIKWID_PERFMON
  #pragma omp parallel
  {
    LIKWID_MARKER_START("Sweep");
  }
#endif
TomTheBear commented 7 years ago

If you want to call it outside of a parallel region, yes. The advantage is less overhead but there are also disadvantages when the calls are outside of a loop like no overflow recognition, no possibility to switch groups at runtime and of course no call count detection. INIT and CLOSE in serial regions THREADINIT, START, STOP, GET and SWITCH inside parallel regions.

sguera commented 7 years ago

fixed in https://github.com/RRZE-HPC/stempel/commit/f89935a782fea93a46eb433781af16e487adac43

sguera commented 7 years ago

to be discussed whether is preferable a code looking like this:

   #ifdef LIKWID_PERFMON
   #pragma omp parallel
   {
     LIKWID_MARKER_START("Sweep");
   }
   #endif
  while (runtime < 0.5)
  {
    timing(&wct_start, &cput_start);
    for (int n = 0; n < repeat; ++n)
    {
      kernel_loop(a, b, W);
      tmp = a;
      a = b;
      b = a;
    }

    timing(&wct_end, &cput_end);
    runtime = wct_end - wct_start;
    repeat *= 2;
  }

  #ifdef LIKWID_PERFMON
  #pragma omp parallel
  {
    LIKWID_MARKER_STOP("Sweep");
  }
  #endif

or like this:

 while (runtime < 0.5)
 {
   timing(&wct_start, &cput_start);
   for (int n = 0; n < repeat; ++n)
   {
      #pragma omp parallel
      {
         #ifdef LIKWID_PERFMON
           LIKWID_MARKER_START("Sweep");
        #endif

        kernel_loop(a, b, W);

        #ifdef LIKWID_PERFMON
          LIKWID_MARKER_STOP("Sweep");
       #endif
     }
     tmp = a;
     a = b;
     b = a;
   }

   timing(&wct_end, &cput_end);
   runtime = wct_end - wct_start;
   repeat *= 2;
}
TomTheBear commented 7 years ago

Since you have the while loop that limits the runtime to 0.5 seconds or a single call of kernel_loop, it is probably the better choice to put the calls outside of the while loop.

Another version with one region per repeat value would be (just for discussion):

char rname[100];
while (runtime < 0.5)
 {
   snprintf(rname, 99, "Sweep_%d_repeats",repeat);
   #ifdef LIKWID_PERFMON
   #pragma omp parallel
   {
      LIKWID_MARKER_START(rname);
   }
   #endif
   timing(&wct_start, &cput_start);
   for (int n = 0; n < repeat; ++n)
   {
      #pragma omp parallel
      {
        kernel_loop(a, b, W);
      }
     tmp = a;
     a = b;
     b = a;
   }

   timing(&wct_end, &cput_end);
   #ifdef LIKWID_PERFMON
   #pragma omp parallel
   {
      LIKWID_MARKER_STOP(rname);
   }
   #endif
   runtime = wct_end - wct_start;
   repeat *= 2;
}
sguera commented 7 years ago

In case I put it inside the while, but without your solution for having several names (with the array of chars + sprintf), It would be overwritten every time, so the only values I would get are the ones of the run with runtime > 0.5, isn't it? In which case would be fine.

I do not think it is correct if we get the counters from all the runs, which should happen in case I leave the LIKWID_MARKER_START("Sweep"); outside the while. Am I wrong?

TomTheBear commented 7 years ago

If you don't change the names, the calls are accumulated not overwritten. So, you get the summed up values until the runtime is > 0.5.

Why shouldn't it be incorrect to get the counters of all runs? As long as you don't change the sizes and/or the algorithm, there is no difference in the runs except the runtime of each region call.

sguera commented 7 years ago

Yes, I know there is no difference but then you would get runtime of 1 run (the longest until runtime > 0.5) and the values of the counters as cumulative. Additionally also the number of repetitions and statistics are referred to a single run. That was my only "fear". Anyway I'll keep it outside for now. Thanks for contributing