martin-ueding commented 7 years ago

In the Dslash classes, there are pairs of functions for the plus and minus case. This sign is the isign which states whether the Dslash or its hermitian conjugate is to be applied. These pairs look like this:

void NDTMDyzPlus(int tid,
                 const FourSpinorBlock *psi,
                 FourSpinorBlock *res,
                 const SU3MatrixBlock *u,
                 double mu,
                 double mu_inv,
                 int cb);

void NDTMDyzMinus(int tid,
                  const FourSpinorBlock *psi,
                  FourSpinorBlock *res,
                  const SU3MatrixBlock *u,
                  double mu,
                  double mu_inv,
                  int cb);

Looking through both implementations with diff, I found that they only differ on a single line, namely in the kernel they call. In the NDTM case, the caller looked like this:

template <typename FT, int veclen, int soalen, bool compress12>
void TMClovDslash<FT, veclen, soalen, compress12>::dslash(
    FourSpinorBlock *res,
    const FourSpinorBlock *psi,
    const SU3MatrixBlock *u,
    const FullCloverBlock *clov[2],
    int isign,
    int cb)
{
    if (isign == 1) {
        DPsiPlus(u, clov[0], psi, res, cb);
    }

    if (isign == -1) {
        DPsiMinus(u, clov[1], psi, res, cb);
    }
}

Now it just passes the isign parameter onward and chooses the right part of the clover term:

template <typename FT, int veclen, int soalen, bool compress12>
void NDTMClovDslash<FT, veclen, soalen, compress12>::dslash(
    FourSpinorBlock *res,
    const FourSpinorBlock *psi,
    const SU3MatrixBlock *u,
    const FullCloverBlock *invclov[2],
    int isign,
    int cb)
{
    DPsi(u, invclov[isign == 1 ? 0 : 1], psi, res, cb, isign);
}

The functions DPsiPlus and DPsiMinus have been unified to a single DPsi function where the kernel to use is chosen depending on isign:

auto kernel =
    (isign == 1 ? ndtm_clov_dslash_plus_vec<FT, veclen, soalen, compress12>
                : ndtm_clov_dslash_minus_vec<FT, veclen, soalen, compress12>);

kernel(xyBase + X, zbBase + X, ..., forw_t_coeff_T, back_t_coeff_T);

This way several thousand lines of replicated code can be removed. There should not really be any change in the performance because the branching has to be done at some point anyway. The user will call the dslash function which will call all the depending functions.

Further, there seems to be a large overlap between the following:

Plus and Minus versions of the same function
Dslash and achimbdpsi operators: They use different factors (anisotropy vs. beta), but it seems that those are just different names for pretty much the same thing. A lot of duplicated code could be saved there.
There is probably a lot of duplicated code between the non-clover and clover operators.
Since every new operator seems to be a copy of the previous code, a lot of common elements (offset calculations, ...) could be refactored into common functions.

martin-ueding commented 7 years ago

For the Wilson dslash, I have removed 800 lines of code which were plainly redundant. I have started with the DPsiPlus and DPsiMinus functions which are pretty much the same:

dpsiplus

Then the same can be done for DyzPlus and DyzMinus:

dyzplus

Then similarly I did the “A chi - b D psi” functions. The same could easily done also for Wilson clover, TM and TM clover. In total that would get rid of 3200 lines of code.

But one does not have to stop there. The functions DPsi and DPsiAChiMinusBDPsi show a significant similarity:

dpsi

There are some caveats, like the number of arguments passed. Also the Dyz and DyzAChiMinusBDPsi functions share a lot of code, though they also differ:

dyz

There one would have to refactor all the common code and isolate the different parts into a function object perhaps. That would make it possible to re-use more of the common code, even across the various Dslash variants.

martin-ueding commented 7 years ago

I have removed the duplicated code from all four variants, removing over 3000 lines of code. The clover tests fail, so I need to fix that. The other free variants seem to be just fine.

martin-ueding commented 7 years ago

I just did a copy-mistake error, all non-MPI tests work now. I'll have to test that on JURECA and Marconi eventually to see whether MPI works correct in a realistic environment.

kostrzewa commented 7 years ago

The functions DPsiPlus and DPsiMinus have been unified to a single DPsi function where the kernel to use is chosen depending on isign:

Can you make sure that this really doesn't affect performance in a solver? It seems like there will be a lot of instantiation going on, or am I wrong?

martin-ueding commented 7 years ago

Remaining duplicity

D psi — A chi - b D psi

There is still a lot of common code between each dslash and dslashAChiMinusBDPsi implementation. Those differ only slightly. The most important difference is the additional spinor argument.

Wilson -- TM Wilson

There are a few differences in the DPsi functions of Wilson and TM Wilson. They seem to be somewhat similar, except that one uses a queue, the other seems to have a static communication scheme.

@@ -1,10 +1,11 @@

 template <typename FT, int veclen, int soalen, bool compress12>
-void Dslash<FT, veclen, soalen, compress12>::DPsi(const SU3MatrixBlock *u,
-                                                  const FourSpinorBlock *psi_in,
-                                                  FourSpinorBlock *res_out,
-                                                  bool is_plus,
-                                                  int cb)
+void TMDslash<FT, veclen, soalen, compress12>::TMDPsi(
+    const SU3MatrixBlock *u,
+    const FourSpinorBlock *psi_in,
+    FourSpinorBlock *res_out,
+    bool const is_plus,
+    int cb)
 {
   double beta_s = -aniso_coeff_S;
   double beta_t_f = -aniso_coeff_T;
@@ -32,14 +33,12 @@
       comms->startSendDir(2 * d + 1);
       comms->startSendDir(2 * d + 0);
-      comms->recv_queue.push(2 * d + 1);
-      comms->recv_queue.push(2 * d + 0);
     }
   }
 #endif // QPHIX_DO_COMMS
@@ -48,24 +47,40 @@
 #pragma omp parallel
   {
     int tid = omp_get_thread_num();
-    Dyz(tid, psi_in, res_out, u, is_plus, cb);
+    TMDyz(tid, psi_in, res_out, u, is_plus, cb);
   }

 #ifdef QPHIX_DO_COMMS
-  while (!comms->recv_queue.empty()) {
-    int d = comms->recv_queue.front();
-    comms->recv_queue.pop();
-    if (comms->testSendToDir(d) && comms->testRecvFromDir(d)) {
+  for (int d = 3; d >= 0; d--) {
+    if (!comms->localDir(d)) {
+      comms->finishSendDir(2 * d + 1);
+      comms->finishRecvFromDir(2 * d + 0);
+      comms->finishSendDir(2 * d + 0);
+      comms->finishRecvFromDir(2 * d + 1);
+
 #pragma omp parallel
       {
         int tid = omp_get_thread_num();
-
-        double bet = (d / 2 == 3 ? (d % 2 == 0 ? beta_t_b : beta_t_f) : beta_s);
-        completeFaceDir(
-            tid, comms->recvFromDir[d], res_out, u, bet, cb, d / 2, d % 2, 1);
+        completeTMFaceDir(tid,
+                          comms->recvFromDir[2 * d + 0],
+                          res_out,
+                          u,
+                          (d == 3 ? beta_t_b : beta_s),
+                          cb,
+                          d,
+                          0,
+                          is_plus);
+        completeTMFaceDir(tid,
+                          comms->recvFromDir[2 * d + 1],
+                          res_out,
+                          u,
+                          (d == 3 ? beta_t_f : beta_s),
+                          cb,
+                          d,
+                          1,
+                          is_plus);
       }
-    } else
-      comms->recv_queue.push(d);
+    }
   }
 #endif // QPHIX_DO_COMMS
 }

The Dyz functions are very similar, one just passes the twisted mass argument to the kernel:

@@ -1,11 +1,11 @@

 template <typename FT, int veclen, int soalen, bool compress12>
-void Dslash<FT, veclen, soalen, compress12>::Dyz(int tid,
-                                                 const FourSpinorBlock *psi,
-                                                 FourSpinorBlock *res,
-                                                 const SU3MatrixBlock *u,
-                                                 bool is_plus,
-                                                 int cb)
+void TMDslash<FT, veclen, soalen, compress12>::TMDyz(int tid,
+                                                     const FourSpinorBlock *psi,
+                                                     FourSpinorBlock *res,
+                                                     const SU3MatrixBlock *u,
+                                                     bool const is_plus,
+                                                     int cb)
 {
   const int Nxh = s->Nxh();
   const int Nx = s->Nx();
@@ -28,7 +28,9 @@
   int smtid_z = smtid / Sy;
   int smtid_y = smtid - Sy * smtid_z;

+  // unsigned int accumulate[8] = { -1, -1, -1, -1, -1, -1, -1, -1 };
   unsigned int accumulate[8] = {~0U, ~0U, ~0U, ~0U, ~0U, ~0U, ~0U, ~0U};
+
   int nvecs = s->nVecs();

   const int gauge_line_in_floats =
@@ -56,11 +58,9 @@
   int soprefdist = 0;

 #if defined(__GNUG__) && !defined(__INTEL_COMPILER)
-  int *tmpspc __attribute__((aligned(QPHIX_LLC_CACHE_ALIGN))) =
-      &(tmpspc_all[veclen * 16 * tid]);
+  int *tmpspc __attribute__((aligned(64))) = &(tmpspc_all[veclen * 16 * tid]);
 #else
-  __declspec(align(QPHIX_LLC_CACHE_ALIGN)) int *tmpspc =
-      &(tmpspc_all[veclen * 16 * tid]);
+  __declspec(align(64)) int *tmpspc = &(tmpspc_all[veclen * 16 * tid]);
 #endif

   int *offs, *xbOffs, *xfOffs, *ybOffs, *yfOffs, *gOffs, *pfyOffs;
@@ -241,6 +241,7 @@
             accumulate[3] = (yi == Ny - nyg ? yfmask_yn : -1);
 #endif

+// using Cilk array notations:
 #ifdef QPHIX_USE_CEAN
             pfyOffs [0:(veclen + 1) / 2] = ybOffs [0:(veclen + 1) / 2];
             pfyOffs [(veclen + 1) / 2:veclen / 2] =
@@ -264,8 +265,8 @@
             FT back_t_coeff_T = rep<FT, double>(back_t_coeff);

             auto kernel =
-                (is_plus ? dslash_plus_vec<FT, veclen, soalen, compress12>
-                         : dslash_minus_vec<FT, veclen, soalen, compress12>);
+                (is_plus ? tm_dslash_plus_vec<FT, veclen, soalen, compress12>
+                         : tm_dslash_minus_vec<FT, veclen, soalen, compress12>);

             kernel(xyBase + X,
                    zbBase + X,
@@ -292,7 +293,9 @@
                    accumulate,
                    aniso_coeff_s_T,
                    forw_t_coeff_T,
-                   back_t_coeff_T);
+                   back_t_coeff_T,
+                   derived_mu,
+                   derived_mu_inv);
           }
         } // End for over scanlines y
       } // End for over scalines z

A lot of common code is in the ctor:

@@ -1,16 +1,21 @@

 template <typename FT, int veclen, int soalen, bool compress12>
-Dslash<FT, veclen, soalen, compress12>::Dslash(
+TMDslash<FT, veclen, soalen, compress12>::TMDslash(
     Geometry<FT, veclen, soalen, compress12> *geom_,
     double t_boundary_,
     double aniso_coeff_S_,
-    double aniso_coeff_T_)
+    double aniso_coeff_T_,
+    double Mass_,
+    double TwistedMass_)
     : s(geom_), comms(new Comms<FT, veclen, soalen, compress12>(geom_)),
       By(geom_->getBy()), Bz(geom_->getBz()), NCores(geom_->getNumCores()),
       Sy(geom_->getSy()), Sz(geom_->getSz()), PadXY(geom_->getPadXY()),
       PadXYZ(geom_->getPadXYZ()), MinCt(geom_->getMinCt()),
       n_threads_per_core(geom_->getSy() * geom_->getSz()), t_boundary(t_boundary_),
-      aniso_coeff_S(aniso_coeff_S_), aniso_coeff_T(aniso_coeff_T_)
+      aniso_coeff_S(aniso_coeff_S_), aniso_coeff_T(aniso_coeff_T_), Mass(Mass_),
+      TwistedMass(TwistedMass_), derived_mu(TwistedMass / (4.0 + Mass)),
+      derived_mu_inv((4.0 + Mass) /
+                     (TwistedMass * TwistedMass + (4.0 + Mass) * (4.0 + Mass)))
 {
   // OK we need to set up log of veclen
   log2veclen = 0;
@@ -104,8 +109,7 @@
   // Set up block info array. These are thread local
   // Indices run as (phases fastest and threads slowest)
   int num_blockinfo = num_phases * s->getNumCores() * n_threads_per_core;
-  block_info = (BlockPhase *)ALIGNED_MALLOC(num_blockinfo * sizeof(BlockPhase),
-                                            QPHIX_LLC_CACHE_ALIGN);
+  block_info = (BlockPhase *)ALIGNED_MALLOC(num_blockinfo * sizeof(BlockPhase), 64);
   if (block_info == 0x0) {
     fprintf(stderr, "Could not allocate Block Info array\n");
     abort();
@@ -146,7 +150,7 @@
   // Alloc tmpspc. It is thread local so we need one for
   // every thread
   size_t tmpspc_size = num_cores * n_threads_per_core * veclen * 16 * sizeof(int);
-  tmpspc_all = (int *)ALIGNED_MALLOC(tmpspc_size, QPHIX_LLC_CACHE_ALIGN);
+  tmpspc_all = (int *)ALIGNED_MALLOC(tmpspc_size, 64);
   if (tmpspc_all == 0x0) {
     printf("Failed to allocate xy offset tmpspc\n");
     abort();

The dtor is exactly the same:

@@ -1,6 +1,6 @@

 template <typename FT, int veclen, int soalen, bool compress12>
-Dslash<FT, veclen, soalen, compress12>::~Dslash()
+TMDslash<FT, veclen, soalen, compress12>::~TMDslash()
 {
   masterPrintf("Freeing BlockInfo\n");
   ALIGNED_FREE(block_info);

That looks like a sensible option would be to extract a common base class and use non-virtual inheritance (in order to get no runtime penalty).

Wilson — Clover

Also there is a lot of shared code between Wilson and Clover (and presumably between TM Wilson and TM Clover).

What would you think of pulling stuff into a common base class and using non-virtual inheritance? That would mean that there is no polymorphism, but we can reuse implementations across the four Dslash implementations.

kostrzewa commented 7 years ago

I just did a copy-mistake error, all non-MPI tests work now. I'll have to test that on JURECA and Marconi eventually to see whether MPI works correct in a realistic environment.

One surprising thing which I noticed is that QPhiX is almost stupidly efficient when N_tasks >> N_Cores. I've run as many 64 MPI processes on my dual core laptop, with an overhead of only around 30% compare to a single MPI process with two threads... It seems like all the MPI testing can really be done easily on a development machine.

kostrzewa commented 7 years ago

@bjoo As you have noticed, with these changes we might be making some significant architectural modifications in QPhiX, some of which may affect client code, depending on how QPhiX is being used. I think there's no question that what Martin is doing is improving legibility and maintainability of the codebase, but it would be good to ensure that we don't step on anybody's toes too much...

martin-ueding commented 7 years ago

Can you make sure that this really doesn't affect performance in a solver?

I can definitely test the performance. If it is worse than before, I'll try to fix that.

It seems like there will be a lot of instantiation going on, or am I wrong?

The control flow has not changed that much:

control_flow

The auto kernel should just be a function pointer, so it might not really hurt in the inner loop. And since is_plus is marked const, I hope that the compiler would only evaluate that once. If that is not the case, I could think of making it like this:

template <typename FT, int veclen, int soalen, bool compress12>
template <bool is_plus>
void Dslash<FT, veclen, soalen, compress12>::Dyz(int tid,
                                                 const FourSpinorBlock *psi,
                                                 FourSpinorBlock *res,
                                                 const SU3MatrixBlock *u,
                                                 bool is_plus,
                                                 int cb)

Then I would also pull out the auto kernel = … line to the top-level function scope. With is_plus being a template parameter, it will force the compiler to generate two copies of the function, one plus and one minus version, just like there was before. Also the function call could be done without the function pointer, just listing the arguments twice in an if (is_plus). That would still remove the bulk of duplicated code but I cannot image how performance could be any worse.

It seems like all the MPI testing can really be done easily on a development machine.

Good to know, I'll then try to compile all that with GCC and MPICH on my laptop.

[…] with these changes we might be making some significant architectural modifications in QPhiX […]

Everything I have done here was just in private functions. No other code should be affected.

[…] but it would be good to ensure that we don't step on anybody's toes too much

Definitely! Chroma should always link against QPhiX; tmLQCD should be able to do so as well. Then there is MILC which uses QPhiX. Do we know of other uses that we should keep in mind?

kostrzewa commented 7 years ago

Everything I have done here was just in private functions. No other code should be affected.

I mean, when there is client code which relies heavily on calling DPsiPlus/Minus directly, or something like that. We might also have to think about the MILC people who [are, will be?] implementing the staggered Dirac operator in QPhiX and how they interface their code with QPhiX.

kostrzewa commented 7 years ago

The branching for dslash_[minus,plus]_vec on the inside of the loop might really hurt, but I don't know what compilers really make out of this, so I'm just quoting what one reads about branching inside of threaded loops.

kostrzewa commented 7 years ago

Everything I have done here was just in private functions. No other code should be affected.

Oh, I see, indeed even the two DPsi are private, I hadn't realised that.

martin-ueding commented 7 years ago

Performance on JURECA (Haswell) with the devel Branch is 128.85 ± 1.70 GF/s, the redundance-reduction branch gives 128.71 ± 1.45 GF/s. Both on a single node, two MPI ranks. Lattice is 16^4, double precision:

#!/bin/bash
# Copyright © 2016-2017 Martin Ueding <dev@martin-ueding.de>

#SBATCH --partition=devel
#SBATCH --time=02:00:00

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=12
#SBATCH --output=timings.slurm-%j.out.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=ueding@hiskp.uni-bonn.de

module load Intel
module load IntelMPI
module list

set -e
set -u
set -x

export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export KMP_AFFINITY=compact,0

pushd ~/qphix-twisted-test/build-icc/qphix/tests

for runner in time_clov_noqdp; do
    srun "$runner" \
        -x 16 -y 16 -z 16 -t 16 -i 100 -prec d \
        -dslash \
        -by 4 -bz 4 -pxy 1 -pxyz 0 -minct 1 \
        -c ${SLURM_CPUS_PER_TASK} -sy 1 -sz 1 -geom 1 1 1 2
done

Do you want more tests? Shall I implement the template variant suggested above just to make sure?

I have done objdump -d testClovDslashFull.o and searched for Dyz and the kernel that it calls, clov_dslash_plus_vec. I could find a couple of paragraphs for those, probably one for each template specialization. The strange this is that the Dyz does not call the kernel anywhere by its name. Since the kernel functions are marked inline, the compiler might inlined it. But then it surprises me that they still show up in the object file explicitly‽

Perhaps one need to compile with -g in order to get more sensible names from the object file.

kostrzewa commented 7 years ago

Do you want more tests? Shall I implement the template variant suggested above just to make sure?

Sounds good so far, would probably test on Marconi, just to make sure..

kostrzewa commented 7 years ago

While doing so, could you run the whole timing suite? (-mmat -cg -bicgstab)

JeffersonLab / qphix

Code duplication #40

Remaining duplicity

D psi — A chi - b D psi

Wilson -- TM Wilson

Wilson — Clover