Closed martin-ueding closed 7 years ago
For the Wilson dslash, I have removed 800 lines of code which were plainly redundant. I have started with the DPsiPlus
and DPsiMinus
functions which are pretty much the same:
Then the same can be done for DyzPlus
and DyzMinus
:
Then similarly I did the “A chi - b D psi” functions. The same could easily done also for Wilson clover, TM and TM clover. In total that would get rid of 3200 lines of code.
But one does not have to stop there. The functions DPsi
and DPsiAChiMinusBDPsi
show a significant similarity:
There are some caveats, like the number of arguments passed. Also the Dyz
and DyzAChiMinusBDPsi
functions share a lot of code, though they also differ:
There one would have to refactor all the common code and isolate the different parts into a function object perhaps. That would make it possible to re-use more of the common code, even across the various Dslash variants.
I have removed the duplicated code from all four variants, removing over 3000 lines of code. The clover tests fail, so I need to fix that. The other free variants seem to be just fine.
I just did a copy-mistake error, all non-MPI tests work now. I'll have to test that on JURECA and Marconi eventually to see whether MPI works correct in a realistic environment.
The functions DPsiPlus and DPsiMinus have been unified to a single DPsi function where the kernel to use is chosen depending on isign:
Can you make sure that this really doesn't affect performance in a solver? It seems like there will be a lot of instantiation going on, or am I wrong?
There is still a lot of common code between each dslash
and dslashAChiMinusBDPsi
implementation. Those differ only slightly. The most important difference is the additional spinor argument.
There are a few differences in the DPsi
functions of Wilson and TM Wilson. They seem to be somewhat similar, except that one uses a queue, the other seems to have a static communication scheme.
@@ -1,10 +1,11 @@
template <typename FT, int veclen, int soalen, bool compress12>
-void Dslash<FT, veclen, soalen, compress12>::DPsi(const SU3MatrixBlock *u,
- const FourSpinorBlock *psi_in,
- FourSpinorBlock *res_out,
- bool is_plus,
- int cb)
+void TMDslash<FT, veclen, soalen, compress12>::TMDPsi(
+ const SU3MatrixBlock *u,
+ const FourSpinorBlock *psi_in,
+ FourSpinorBlock *res_out,
+ bool const is_plus,
+ int cb)
{
double beta_s = -aniso_coeff_S;
double beta_t_f = -aniso_coeff_T;
@@ -32,14 +33,12 @@
comms->startSendDir(2 * d + 1);
comms->startSendDir(2 * d + 0);
- comms->recv_queue.push(2 * d + 1);
- comms->recv_queue.push(2 * d + 0);
}
}
#endif // QPHIX_DO_COMMS
@@ -48,24 +47,40 @@
#pragma omp parallel
{
int tid = omp_get_thread_num();
- Dyz(tid, psi_in, res_out, u, is_plus, cb);
+ TMDyz(tid, psi_in, res_out, u, is_plus, cb);
}
#ifdef QPHIX_DO_COMMS
- while (!comms->recv_queue.empty()) {
- int d = comms->recv_queue.front();
- comms->recv_queue.pop();
- if (comms->testSendToDir(d) && comms->testRecvFromDir(d)) {
+ for (int d = 3; d >= 0; d--) {
+ if (!comms->localDir(d)) {
+ comms->finishSendDir(2 * d + 1);
+ comms->finishRecvFromDir(2 * d + 0);
+ comms->finishSendDir(2 * d + 0);
+ comms->finishRecvFromDir(2 * d + 1);
+
#pragma omp parallel
{
int tid = omp_get_thread_num();
-
- double bet = (d / 2 == 3 ? (d % 2 == 0 ? beta_t_b : beta_t_f) : beta_s);
- completeFaceDir(
- tid, comms->recvFromDir[d], res_out, u, bet, cb, d / 2, d % 2, 1);
+ completeTMFaceDir(tid,
+ comms->recvFromDir[2 * d + 0],
+ res_out,
+ u,
+ (d == 3 ? beta_t_b : beta_s),
+ cb,
+ d,
+ 0,
+ is_plus);
+ completeTMFaceDir(tid,
+ comms->recvFromDir[2 * d + 1],
+ res_out,
+ u,
+ (d == 3 ? beta_t_f : beta_s),
+ cb,
+ d,
+ 1,
+ is_plus);
}
- } else
- comms->recv_queue.push(d);
+ }
}
#endif // QPHIX_DO_COMMS
}
The Dyz
functions are very similar, one just passes the twisted mass argument to the kernel:
@@ -1,11 +1,11 @@
template <typename FT, int veclen, int soalen, bool compress12>
-void Dslash<FT, veclen, soalen, compress12>::Dyz(int tid,
- const FourSpinorBlock *psi,
- FourSpinorBlock *res,
- const SU3MatrixBlock *u,
- bool is_plus,
- int cb)
+void TMDslash<FT, veclen, soalen, compress12>::TMDyz(int tid,
+ const FourSpinorBlock *psi,
+ FourSpinorBlock *res,
+ const SU3MatrixBlock *u,
+ bool const is_plus,
+ int cb)
{
const int Nxh = s->Nxh();
const int Nx = s->Nx();
@@ -28,7 +28,9 @@
int smtid_z = smtid / Sy;
int smtid_y = smtid - Sy * smtid_z;
+ // unsigned int accumulate[8] = { -1, -1, -1, -1, -1, -1, -1, -1 };
unsigned int accumulate[8] = {~0U, ~0U, ~0U, ~0U, ~0U, ~0U, ~0U, ~0U};
+
int nvecs = s->nVecs();
const int gauge_line_in_floats =
@@ -56,11 +58,9 @@
int soprefdist = 0;
#if defined(__GNUG__) && !defined(__INTEL_COMPILER)
- int *tmpspc __attribute__((aligned(QPHIX_LLC_CACHE_ALIGN))) =
- &(tmpspc_all[veclen * 16 * tid]);
+ int *tmpspc __attribute__((aligned(64))) = &(tmpspc_all[veclen * 16 * tid]);
#else
- __declspec(align(QPHIX_LLC_CACHE_ALIGN)) int *tmpspc =
- &(tmpspc_all[veclen * 16 * tid]);
+ __declspec(align(64)) int *tmpspc = &(tmpspc_all[veclen * 16 * tid]);
#endif
int *offs, *xbOffs, *xfOffs, *ybOffs, *yfOffs, *gOffs, *pfyOffs;
@@ -241,6 +241,7 @@
accumulate[3] = (yi == Ny - nyg ? yfmask_yn : -1);
#endif
+// using Cilk array notations:
#ifdef QPHIX_USE_CEAN
pfyOffs [0:(veclen + 1) / 2] = ybOffs [0:(veclen + 1) / 2];
pfyOffs [(veclen + 1) / 2:veclen / 2] =
@@ -264,8 +265,8 @@
FT back_t_coeff_T = rep<FT, double>(back_t_coeff);
auto kernel =
- (is_plus ? dslash_plus_vec<FT, veclen, soalen, compress12>
- : dslash_minus_vec<FT, veclen, soalen, compress12>);
+ (is_plus ? tm_dslash_plus_vec<FT, veclen, soalen, compress12>
+ : tm_dslash_minus_vec<FT, veclen, soalen, compress12>);
kernel(xyBase + X,
zbBase + X,
@@ -292,7 +293,9 @@
accumulate,
aniso_coeff_s_T,
forw_t_coeff_T,
- back_t_coeff_T);
+ back_t_coeff_T,
+ derived_mu,
+ derived_mu_inv);
}
} // End for over scanlines y
} // End for over scalines z
A lot of common code is in the ctor:
@@ -1,16 +1,21 @@
template <typename FT, int veclen, int soalen, bool compress12>
-Dslash<FT, veclen, soalen, compress12>::Dslash(
+TMDslash<FT, veclen, soalen, compress12>::TMDslash(
Geometry<FT, veclen, soalen, compress12> *geom_,
double t_boundary_,
double aniso_coeff_S_,
- double aniso_coeff_T_)
+ double aniso_coeff_T_,
+ double Mass_,
+ double TwistedMass_)
: s(geom_), comms(new Comms<FT, veclen, soalen, compress12>(geom_)),
By(geom_->getBy()), Bz(geom_->getBz()), NCores(geom_->getNumCores()),
Sy(geom_->getSy()), Sz(geom_->getSz()), PadXY(geom_->getPadXY()),
PadXYZ(geom_->getPadXYZ()), MinCt(geom_->getMinCt()),
n_threads_per_core(geom_->getSy() * geom_->getSz()), t_boundary(t_boundary_),
- aniso_coeff_S(aniso_coeff_S_), aniso_coeff_T(aniso_coeff_T_)
+ aniso_coeff_S(aniso_coeff_S_), aniso_coeff_T(aniso_coeff_T_), Mass(Mass_),
+ TwistedMass(TwistedMass_), derived_mu(TwistedMass / (4.0 + Mass)),
+ derived_mu_inv((4.0 + Mass) /
+ (TwistedMass * TwistedMass + (4.0 + Mass) * (4.0 + Mass)))
{
// OK we need to set up log of veclen
log2veclen = 0;
@@ -104,8 +109,7 @@
// Set up block info array. These are thread local
// Indices run as (phases fastest and threads slowest)
int num_blockinfo = num_phases * s->getNumCores() * n_threads_per_core;
- block_info = (BlockPhase *)ALIGNED_MALLOC(num_blockinfo * sizeof(BlockPhase),
- QPHIX_LLC_CACHE_ALIGN);
+ block_info = (BlockPhase *)ALIGNED_MALLOC(num_blockinfo * sizeof(BlockPhase), 64);
if (block_info == 0x0) {
fprintf(stderr, "Could not allocate Block Info array\n");
abort();
@@ -146,7 +150,7 @@
// Alloc tmpspc. It is thread local so we need one for
// every thread
size_t tmpspc_size = num_cores * n_threads_per_core * veclen * 16 * sizeof(int);
- tmpspc_all = (int *)ALIGNED_MALLOC(tmpspc_size, QPHIX_LLC_CACHE_ALIGN);
+ tmpspc_all = (int *)ALIGNED_MALLOC(tmpspc_size, 64);
if (tmpspc_all == 0x0) {
printf("Failed to allocate xy offset tmpspc\n");
abort();
The dtor is exactly the same:
@@ -1,6 +1,6 @@
template <typename FT, int veclen, int soalen, bool compress12>
-Dslash<FT, veclen, soalen, compress12>::~Dslash()
+TMDslash<FT, veclen, soalen, compress12>::~TMDslash()
{
masterPrintf("Freeing BlockInfo\n");
ALIGNED_FREE(block_info);
That looks like a sensible option would be to extract a common base class and use non-virtual inheritance (in order to get no runtime penalty).
Also there is a lot of shared code between Wilson and Clover (and presumably between TM Wilson and TM Clover).
What would you think of pulling stuff into a common base class and using non-virtual inheritance? That would mean that there is no polymorphism, but we can reuse implementations across the four Dslash implementations.
I just did a copy-mistake error, all non-MPI tests work now. I'll have to test that on JURECA and Marconi eventually to see whether MPI works correct in a realistic environment.
One surprising thing which I noticed is that QPhiX is almost stupidly efficient when N_tasks >> N_Cores. I've run as many 64 MPI processes on my dual core laptop, with an overhead of only around 30% compare to a single MPI process with two threads... It seems like all the MPI testing can really be done easily on a development machine.
@bjoo As you have noticed, with these changes we might be making some significant architectural modifications in QPhiX, some of which may affect client code, depending on how QPhiX is being used. I think there's no question that what Martin is doing is improving legibility and maintainability of the codebase, but it would be good to ensure that we don't step on anybody's toes too much...
Can you make sure that this really doesn't affect performance in a solver?
I can definitely test the performance. If it is worse than before, I'll try to fix that.
It seems like there will be a lot of instantiation going on, or am I wrong?
The control flow has not changed that much:
The auto kernel
should just be a function pointer, so it might not really hurt in the inner loop. And since is_plus
is marked const
, I hope that the compiler would only evaluate that once. If that is not the case, I could think of making it like this:
template <typename FT, int veclen, int soalen, bool compress12>
template <bool is_plus>
void Dslash<FT, veclen, soalen, compress12>::Dyz(int tid,
const FourSpinorBlock *psi,
FourSpinorBlock *res,
const SU3MatrixBlock *u,
bool is_plus,
int cb)
Then I would also pull out the auto kernel = …
line to the top-level function scope. With is_plus
being a template parameter, it will force the compiler to generate two copies of the function, one plus
and one minus
version, just like there was before. Also the function call could be done without the function pointer, just listing the arguments twice in an if (is_plus)
. That would still remove the bulk of duplicated code but I cannot image how performance could be any worse.
It seems like all the MPI testing can really be done easily on a development machine.
Good to know, I'll then try to compile all that with GCC and MPICH on my laptop.
[…] with these changes we might be making some significant architectural modifications in QPhiX […]
Everything I have done here was just in private
functions. No other code should be affected.
[…] but it would be good to ensure that we don't step on anybody's toes too much
Definitely! Chroma should always link against QPhiX; tmLQCD should be able to do so as well. Then there is MILC which uses QPhiX. Do we know of other uses that we should keep in mind?
Everything I have done here was just in private functions. No other code should be affected.
I mean, when there is client code which relies heavily on calling DPsiPlus/Minus directly, or something like that. We might also have to think about the MILC people who [are, will be?] implementing the staggered Dirac operator in QPhiX and how they interface their code with QPhiX.
The branching for dslash_[minus,plus]_vec
on the inside of the loop might really hurt, but I don't know what compilers really make out of this, so I'm just quoting what one reads about branching inside of threaded loops.
Everything I have done here was just in private functions. No other code should be affected.
Oh, I see, indeed even the two DPsi are private, I hadn't realised that.
Performance on JURECA (Haswell) with the devel
Branch is 128.85 ± 1.70 GF/s, the redundance-reduction
branch gives 128.71 ± 1.45 GF/s. Both on a single node, two MPI ranks. Lattice is 16^4, double precision:
#!/bin/bash
# Copyright © 2016-2017 Martin Ueding <dev@martin-ueding.de>
#SBATCH --partition=devel
#SBATCH --time=02:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=12
#SBATCH --output=timings.slurm-%j.out.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=ueding@hiskp.uni-bonn.de
module load Intel
module load IntelMPI
module list
set -e
set -u
set -x
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export KMP_AFFINITY=compact,0
pushd ~/qphix-twisted-test/build-icc/qphix/tests
for runner in time_clov_noqdp; do
srun "$runner" \
-x 16 -y 16 -z 16 -t 16 -i 100 -prec d \
-dslash \
-by 4 -bz 4 -pxy 1 -pxyz 0 -minct 1 \
-c ${SLURM_CPUS_PER_TASK} -sy 1 -sz 1 -geom 1 1 1 2
done
Do you want more tests? Shall I implement the template variant suggested above just to make sure?
I have done objdump -d testClovDslashFull.o
and searched for Dyz
and the kernel that it calls, clov_dslash_plus_vec
. I could find a couple of paragraphs for those, probably one for each template specialization. The strange this is that the Dyz
does not call the kernel anywhere by its name. Since the kernel functions are marked inline
, the compiler might inlined it. But then it surprises me that they still show up in the object file explicitly‽
Perhaps one need to compile with -g
in order to get more sensible names from the object file.
Do you want more tests? Shall I implement the template variant suggested above just to make sure?
Sounds good so far, would probably test on Marconi, just to make sure..
While doing so, could you run the whole timing suite? (-mmat -cg -bicgstab)
In the Dslash classes, there are pairs of functions for the plus and minus case. This sign is the
isign
which states whether the Dslash or its hermitian conjugate is to be applied. These pairs look like this:Looking through both implementations with
diff
, I found that they only differ on a single line, namely in the kernel they call. In the NDTM case, the caller looked like this:Now it just passes the
isign
parameter onward and chooses the right part of the clover term:The functions
DPsiPlus
andDPsiMinus
have been unified to a singleDPsi
function where the kernel to use is chosen depending onisign
:This way several thousand lines of replicated code can be removed. There should not really be any change in the performance because the branching has to be done at some point anyway. The user will call the
dslash
function which will call all the depending functions.Further, there seems to be a large overlap between the following: