I run Chroma with the QPhiX clover solvers on an Intel Xeon Haswell (AVX2) architecture. Each node has two Xeons with 12 physical cores, 24 virtual cores. I do not use SMT and a single MPI rank, so that is 24 threads per node.

The 16³×32 lattice works just find on 1, 2, 4, 8, and 32 nodes. A 32³×96 lattice works fine on 8, 64, or 128 nodes. The 24³×96 lattice however, fails on 64 nodes:

QPHIX_RESIDUUM_REPORT:
         Red_Final[0]=2.35420099376685e-310
         Red_Final[1]=2.35420099376685e-310
         Red_Final[2]=0
         Red_Final[3]=0
         Red_Final[4]=2.21341409336878e-321
         Red_Final[5]=6.32404026676796e-322
         Red_Final[6]=0
         Red_Final[7]=1.48219693752374e-323
         Red_Final[8]=1.06527781423771e-316
         Red_Final[9]=0
         Red_Final[10]=0
         Red_Final[11]=3.44237511497523e-316
         Red_Final[12]=0
         Red_Final[13]=0
         Red_Final[14]=3.44235930487457e-316
         Red_Final[15]=7.83598624667517e-12
QPHIX_CLOVER_MULTI_SHIFT_CG_MDAGM_SOLVER: Residua Check: 
         shift[0]  Actual || r || / || b || = 72301.7437396402
QMP m7,n64@jrc0384 error: abort: 1
SOLVE FAILED: rel_resid=72301.7437396402 target=1e-09 tolerance_factor=10 max tolerated=1e-08

On 8 nodes, it works. On 128 it fails, I think. I now have it running on 81. It is probably caused by all the factors of 3 that are in the volume.

It would be nice if there was some error message when the number of nodes does not make sense for QPhiX. At least it fails early on with the residuals, still it took me a while to figure out a working number of nodes, especially since the queueing time for jobs with more than 8 nodes can be several days for my account.

What is this condition? What has do be divisible by what? Then I would attempt to implement this warning and suggest a number of nodes that the user should try instead.

Hello Martin,

Can you post your executable arguments here, especially the qphix specific ones, I.e. by, bz,sy, minCt, etc? I have seen these issues when the block layout does not make sense.

Best Thorsten

Am 28. Jan. 2017, 10:06 -0800 schrieb Martin Ueding notifications@github.com:

I run Chroma with the QPhiX clover solvers on an Intel Xeon Haswell (AVX2) architecture. Each node has two Xeons with 12 physical cores, 24 virtual cores. I do not use SMT and a single MPI rank, so that is 24 threads per node.

The 16³×32 lattice works just find on 1, 2, 4, 8, and 32 nodes. A 32³×96 lattice works fine on 8, 64, or 128 nodes. The 24³×96 lattice however, fails on 64 nodes:

QPHIX_RESIDUUM_REPORT: Red_Final[0]=2.35420099376685e-310 Red_Final[1]=2.35420099376685e-310 Red_Final[2]=0 Red_Final[3]=0 Red_Final[4]=2.21341409336878e-321 Red_Final[5]=6.32404026676796e-322 Red_Final[6]=0 Red_Final[7]=1.48219693752374e-323 Red_Final[8]=1.06527781423771e-316 Red_Final[9]=0 Red_Final[10]=0 Red_Final[11]=3.44237511497523e-316 Red_Final[12]=0 Red_Final[13]=0 Red_Final[14]=3.44235930487457e-316 Red_Final[15]=7.83598624667517e-12 QPHIX_CLOVER_MULTI_SHIFT_CG_MDAGM_SOLVER: Residua Check: shift[0] Actual || r || / || b || = 72301.7437396402 QMP m7,n64@jrc0384 error: abort: 1 SOLVE FAILED: rel_resid=72301.7437396402 target=1e-09 tolerance_factor=10 max tolerated=1e-08

On 8 nodes, it works. On 128 it fails, I think. I now have it running on 81. It is probably caused by all the factors of 3 that are in the volume.

It would be nice if there was some error message when the number of nodes does not make sense for QPhiX. At least it fails early on with the residuals, still it took me a while to figure out a working number of nodes, especially since the queueing time for jobs with more than 8 nodes can be several days for my account.

What is this condition? What has do be divisible by what? Then I would attempt to implement this warning and suggest a number of nodes that the user should try instead.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub (https://github.com/JeffersonLab/qphix/issues/23), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABAQ5kiUdXYmc4gWAJS-8m5ozP_2uQLwks5rW4OEgaJpZM4LwkmI).

The QPhiX arguments are the following:

-by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2

Below is the beginning of the standard output from the job, the lines starting with + are the shell commands executed (I ran bash -x).

+ export OMP_NUM_THREADS=24
+ OMP_NUM_THREADS=24
+ export KMP_AFFINITY=compact,0
+ KMP_AFFINITY=compact,0
+ mkdir -p cfg
+ mkdir -p hmc-out
+ srun ./hmc -i hmc.ini.xml -o hmc-out/hmc.slurm-2831834.out.xml -l hmc-out/hmc.slurm-2831834.log.xml -by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2
QDP use OpenMP threading. We have 24 threads
Affinity reporting not implemented for this architecture
Initialize done
Initializing QPhiX CLI Args
QPhiX CLI Args Initialized
 QPhiX: By=8
 QPhiX: Bz=8
 QPhiX: Pxy=1
 QPhiX: Pxyz=0
 QPhiX: NCores=24
 QPhiX: Sy=1
 QPhiX: Sz=1
 QPhiX: MinCt=2
---%<---
Lattice initialized:
  problem size = 24 24 24 96
  layout size = 12 24 24 96
  logical machine size = 1 4 2 8
  subgrid size = 24 6 12 12
  total number of nodes = 64
  total volume = 1327104
  subgrid volume = 20736

Hi Martin,

For blocking to work, we need local (i.e. per rank) Ny and Nz to be divisible by By and Bz respectively. By and Bz of 8 are good for large volumes but in multinode case, you may try value of 4 or 6 that divides Ny and Nz. In your case, can you please try with –by 6 –bz 6 (or with –by 6 –bz 4). These two blocking may work. I thought we had these checks somewhere but maybe we lost those. We certainly need to add these sanity checks at the time of lattice setup in QPhiX.

Thanks, Dhiraj

From: Martin Ueding [mailto:notifications@github.com] Sent: Saturday, January 28, 2017 11:55 PM To: JeffersonLab/qphix qphix@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [JeffersonLab/qphix] Residuals are very large for certain number of nodes (#23)

The QPhiX arguments are the following:

-by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2

Below is the beginning of the standard output from the job, the lines starting with + are the shell commands executed (I ran bash -x).

export OMP_NUM_THREADS=24
OMP_NUM_THREADS=24
export KMP_AFFINITY=compact,0
KMP_AFFINITY=compact,0
mkdir -p cfg
mkdir -p hmc-out
srun ./hmc -i hmc.ini.xml -o hmc-out/hmc.slurm-2831834.out.xml -l hmc-out/hmc.slurm-2831834.log.xml -by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2

QDP use OpenMP threading. We have 24 threads

Affinity reporting not implemented for this architecture

Initialize done

Initializing QPhiX CLI Args

QPhiX CLI Args Initialized

QPhiX: By=8

QPhiX: Bz=8

QPhiX: Pxy=1

QPhiX: Pxyz=0

QPhiX: NCores=24

QPhiX: Sy=1

QPhiX: Sz=1

QPhiX: MinCt=2

---%<---

Lattice initialized:

problem size = 24 24 24 96

layout size = 12 24 24 96

logical machine size = 1 4 2 8

subgrid size = 24 6 12 12

total number of nodes = 64

total volume = 1327104

subgrid volume = 20736

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/JeffersonLab/qphix/issues/23#issuecomment-275865156, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIYlT0nnF7q1I1VYHbvHKnc6boRkLc-uks5rW4f7gaJpZM4LwkmI.

Hi Martin, Your local volume is 24x6x12x12. After CB this is 12x6x12x12 I am presuming this is AVX/AVX2 on a 2 socket x 12 core system? Your block sizes of 8x8 do not divide this well which may be the source of trouble.

I would a) Run 2 MPI per Node ( bind each to a socket and run with minct=1. With Intel MPI you can use I_MPI_PIN=1 I_MPI_PIN_DOMAIN=socket( With MinCt=2 face buffers may get communicated via the intrasocket (QPI?) and this can be a drag

b) Assuming that after you divide another factor of 2 (because you go to 1 MPI per socket) your volume become 12x6x6x12 after checkerboarding. You should be able to run -by 6 -bz 6 Xeon has a huge L3 cache so this should be OK, hopefully. This will give you 1 block per thread/core since you are not using SMT. NB: Probably you will be affected by strong scaling issues at this point rather than node level issue,,, We will see.

c) For these kinds of dimensions you don’t need padding, set -pxy 0 -pxyz 0

Let me know if this helps, Best, B

On Jan 28, 2017, at 1:24 PM, Martin Ueding notifications@github.com wrote:

The QPhiX arguments are the following:

-by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2

Below is the beginning of the standard output from the job, the lines starting with + are the shell commands executed (I ran bash -x).

export OMP_NUM_THREADS=24

OMP_NUM_THREADS=24

export KMP_AFFINITY=compact,0

KMP_AFFINITY=compact,0

mkdir -p cfg

mkdir -p hmc-out

srun ./hmc -i hmc.ini.xml -o hmc-out/hmc.slurm-2831834.out.xml -l hmc-out/hmc.slurm-2831834.log.xml -by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2 QDP use OpenMP threading. We have 24 threads Affinity reporting not implemented for this architecture Initialize done Initializing QPhiX CLI Args QPhiX CLI Args Initialized QPhiX: By=8 QPhiX: Bz=8 QPhiX: Pxy=1 QPhiX: Pxyz=0 QPhiX: NCores=24 QPhiX: Sy=1 QPhiX: Sz=1 QPhiX: MinCt=2 ---%<--- Lattice initialized: problem size = 24 24 24 96 layout size = 12 24 24 96 logical machine size = 1 4 2 8 subgrid size = 24 6 12 12 total number of nodes = 64 total volume = 1327104 subgrid volume = 20736

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub[github.com], or mute the thread[github.com].

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

The system I run on is JURECA in Jülich, which is a dual socket Xeon machine:

Two Intel Xeon E5-2680 v3 Haswell CPUs per node

2 x 12 cores, 2.5 GHz

Intel Hyperthreading Technology (Simultaneous Multithreading)

AVX 2.0 ISA extension

Divisibility

So the X and Y components of the subgrid size have to be divisable by Bx and By` after checkerboarding? This would certainly explain why it works with 81 but not with 64.

64 nodes: subgrid size = 24 6 12 12, 24/8 works, 6/8 is not integer.
81 nodes: subgrid size = 8 8 8 32, 8/8 is integer.

But it must be before checkerboarding, right? Otherwise it would not have worked on 81 nodes, since then 4/8 would have become a problem.

Two MPI processes per node

I_MPI_PIN=1 I_MPI_PIN_DOMAIN=socket are environment variables to set?

I had tried two MPI processes but went from some minutes per trajectory to 4 hours. But perhaps I did something wrong with the OpenMP thread binding or SMT. I'll have to look into that again since a performance test on 8 nodes showed that although the solver performance is a bit less, the time to solution is improved.

So far I have not tuned the performance extensively, most of my time has been spend on getting the Delta H under control. I will have to look into performance again and make a couple test with the larger lattices.

Hi Martin,

So the X and Y components of the subgrid size have to be divisable by Bx and By` after checkerboarding? This would certainly explain why it works with 81 but not with 64.

There is no Bx currently only By and Bz. So

The X subgrid dimension has to be divisible by the SOALEN after checkerboarding.

They Y and Z dimensions need to be divisible by By and Bz respectively (and are not affected by checkerboarding)

By needs to be divisible by VECLEN/SOALEN.

Here VECLEN= hardware vector lenght and SOALEN is something you choose at compile time.

E.g. AVX and AVX2 allows SOALEN=4 and SOALEN=8 in single prec, for a VECLEN=8 AVX and AVX2 allow SOALEN=2 and SOALEN=4 in double prec. for a VECLEN=4.

Suppose you are in double prec, (Vector Length=4) and choose SOALEN=4. Then each SOALEN load is a full vector load using one Y-coordinate. So By has to be By > 1, and must divide Y

Suppose you are in single prec (Vector legnth=8) and choose SOALEN=4. Then each vector load will be 2 half vector loads, of length 4. These will come from two Y coordinates, y and y+1 so you will want By divisible by 2 AND By must divide Y.

On KNL where in single prec the vector length is 16. If you have SOALEN=4, this wil load from 4 successive Y coordinates. In that case By needs to be divisible by 4 as well as dividing the Y dimension.

Bz is not affected by VECLEN and SOALEN as it is not involved in the vectorization.

Having an X-dimension of only SOALEN (ie one SOALEN length block) after checkerboarding will hurt your strong scaling, since the face reconstructs from +/- X neighbors will hit the same vector and will need to be serialized to avoid conflict. If more than one SOALE block is in X the forward and backward may be able to update simultaneosly the forward and backward blcoks. For best results it may be worth having 2 SOALENs at least in X. With SOALEN=4 this would mean local checkerboarded X-lengths of 8,12, etc so Local uncheckerboarded lengths of 16,24, etc (this last part would apply if you use the nesap_hacklatt_strongscale branch — I can’t remember whether I merged that into devel yet).

Best, B

• 64 nodes: subgrid size = 24 6 12 12, 24/8 works, 6/8 is not integer. • 81 nodes: subgrid size = 8 8 8 32, 8/8 is integer. But it must be before checkerboarding, right? Otherwise it would not have worked on 81 nodes, since then 4/8 would have become a problem.

Two MPI processes per node

I_MPI_PIN=1 I_MPI_PIN_DOMAIN=socket are environment variables to set?

I had tried two MPI processes but went from some minutes per trajectory to 4 hours. But perhaps I did something wrong with the OpenMP thread binding or SMT. I'll have to look into that again since a performance test on 8 nodes showed that although the solver performance is a bit less, the time to solution is improved.

So far I have not tuned the performance extensively, most of my time has been spend on getting the Delta H under control. I will have to look into performance again and make a couple test with the larger lattices.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub[github.com], or mute the thread[github.com].

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

That should be fixed now.

JeffersonLab / qphix

Residuals are very large for certain number of nodes #23

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org

Divisibility

Two MPI processes per node

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org