Closed kostrzewa closed 3 months ago
Most strikingly, when I take the same exact input file, starting configuration and nstore_counter
, I can reproduce the non-convergence of the MG on Juwels Booster:
$ grep GCR log_iwa_1.745-csw1.7112-L48_mu0.00077-kappa0.140008_mg_divergence_7436929.out
GCR: Convergence at 39 iterations, L2 relative residual: iterated = 2.279286e-11, true = 2.279286e-11 (requested = 3.162278e-11)
GCR: Convergence at 67 iterations, L2 relative residual: iterated = 9.960547e-11, true = 9.960547e-11 (requested = 1.000000e-10)
GCR: Convergence at 400 iterations, L2 relative residual: iterated = 1.918760e-05, true = 1.918760e-05 (requested = 1.000000e-10)
## forced setup refresh happens here
GCR: Convergence at 89 iterations, L2 relative residual: iterated = 7.680191e-11, true = 7.680191e-11 (requested = 1.000000e-10)
GCR: Convergence at 400 iterations, L2 relative residual: iterated = 8.970629e-01, true = 8.970629e-01 (requested = 1.000000e-10)
When I do the same exact thing on Meluxina (same input file, same starting config, same nstore_counter
), I cannot reproduce the non-convergence:
$ grep GCR log_mg_diverge_test_b1.745-csw1.7112-mu0.00077-kappa0.1400080-L48_300460.out
GCR: Convergence at 36 iterations, L2 relative residual: iterated = 1.954907e-11, true = 1.954907e-11 (requested = 3.162278e-11)
GCR: Convergence at 43 iterations, L2 relative residual: iterated = 6.908226e-11, true = 6.908226e-11 (requested = 1.000000e-10)
GCR: Convergence at 50 iterations, L2 relative residual: iterated = 9.780117e-11, true = 9.780117e-11 (requested = 1.000000e-10)
GCR: Convergence at 32 iterations, L2 relative residual: iterated = 1.935594e-10, true = 1.935594e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.025196e-10, true = 2.025196e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.676342e-10, true = 2.676342e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.010051e-10, true = 2.010051e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.875690e-10, true = 2.875690e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 3.160003e-10, true = 3.160003e-10 (requested = 3.162278e-10)
GCR: Convergence at 32 iterations, L2 relative residual: iterated = 2.435307e-10, true = 2.435307e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.229918e-10, true = 2.229918e-10 (requested = 3.162278e-10)
GCR: Convergence at 43 iterations, L2 relative residual: iterated = 8.962000e-11, true = 8.962000e-11 (requested = 1.000000e-10)
GCR: Convergence at 50 iterations, L2 relative residual: iterated = 9.654534e-11, true = 9.654534e-11 (requested = 1.000000e-10)
GCR: Convergence at 43 iterations, L2 relative residual: iterated = 7.278812e-11, true = 7.278812e-11 (requested = 1.000000e-10)
GCR: Convergence at 51 iterations, L2 relative residual: iterated = 8.926256e-11, true = 8.926256e-11 (requested = 1.000000e-10)
GCR: Convergence at 35 iterations, L2 relative residual: iterated = 2.029152e-10, true = 2.029152e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.447315e-10, true = 2.447315e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 3.143243e-10, true = 3.143243e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.400427e-10, true = 2.400427e-10 (requested = 3.162278e-10)
GCR: Convergence at 32 iterations, L2 relative residual: iterated = 2.236407e-10, true = 2.236407e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.201591e-10, true = 2.201591e-10 (requested = 3.162278e-10)
GCR: Convergence at 43 iterations, L2 relative residual: iterated = 6.808581e-11, true = 6.808581e-11 (requested = 1.000000e-10)
GCR: Convergence at 51 iterations, L2 relative residual: iterated = 8.203881e-11, true = 8.203881e-11 (requested = 1.000000e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.510451e-10, true = 2.510451e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.738852e-10, true = 2.738852e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.516883e-10, true = 2.516883e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.610188e-10, true = 2.610188e-10 (requested = 3.162278e-10)
GCR: Convergence at 32 iterations, L2 relative residual: iterated = 2.183872e-10, true = 2.183872e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 1.946104e-10, true = 1.946104e-10 (requested = 3.162278e-10)
GCR: Convergence at 42 iterations, L2 relative residual: iterated = 7.442478e-11, true = 7.442478e-11 (requested = 1.000000e-10)
GCR: Convergence at 50 iterations, L2 relative residual: iterated = 9.576162e-11, true = 9.576162e-11 (requested = 1.000000e-10)
GCR: Convergence at 42 iterations, L2 relative residual: iterated = 6.225627e-11, true = 6.225627e-11 (requested = 1.000000e-10)
GCR: Convergence at 50 iterations, L2 relative residual: iterated = 7.362596e-11, true = 7.362596e-11 (requested = 1.000000e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 3.082374e-10, true = 3.082374e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.513237e-10, true = 2.513237e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.211495e-10, true = 2.211495e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.755124e-10, true = 2.755124e-10 (requested = 3.162278e-10)
GCR: Convergence at 32 iterations, L2 relative residual: iterated = 2.428948e-10, true = 2.428948e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.437832e-10, true = 2.437832e-10 (requested = 3.162278e-10)
GCR: Convergence at 41 iterations, L2 relative residual: iterated = 9.501070e-11, true = 9.501070e-11 (requested = 1.000000e-10)
GCR: Convergence at 49 iterations, L2 relative residual: iterated = 9.064841e-11, true = 9.064841e-11 (requested = 1.000000e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.875966e-10, true = 2.875966e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.474461e-10, true = 2.474461e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.849282e-10, true = 2.849282e-10 (requested = 3.162278e-10)
GCR: Convergence at 34 iterations, L2 relative residual: iterated = 2.451119e-10, true = 2.451119e-10 (requested = 3.162278e-10)
GCR: Convergence at 32 iterations, L2 relative residual: iterated = 1.986401e-10, true = 1.986401e-10 (requested = 3.162278e-10)
GCR: Convergence at 33 iterations, L2 relative residual: iterated = 2.683489e-10, true = 2.683489e-10 (requested = 3.162278e-10)
This behaviour adds to the suspicion that there is something wrong on Juwels Booster, either with the network or with the way that different parts of our software stack interact with the machine / the remainder of the software stack.
The stack used on Meluxina:
module load \
env/release/2021.3 \
OpenMPI/4.1.1-GCC-10.3.0 \
CUDA/11.3.1 \
CMake/3.20.4 \
UCX-CUDA/1.10.0-GCCcore-10.3.0-CUDA-11.3.1 \
flex/2.6.4-GCCcore-10.3.0 \
OpenBLAS/0.3.15-GCC-10.3.0 \
GCC/10.3.0 \
Automake/1.16.3-GCCcore-10.3.0 \
numactl
The stack used on Juwels Booster:
module load Stages/2022
module load GCC/11.2.0 \
OpenMPI/4.1.2 \
CUDA/11.5 \
CMake/3.21.1 \
imkl/2021.4.0 \
HDF5/1.12.1 \
Boost/1.78.0
I will run another test with the following stack on Juwels Booster (to be better able to compare):
module load Stages/2020
module load GCC/10.3.0 \
OpenMPI/4.1.1 \
CUDA/11.3 \
CMake/3.18.0 \
imkl/2021.2.0 \
HDF5/1.10.6 \
Boost/1.74.0
Also with this stack the solver ends up diverging:
$ grep GCR log_mg_divergence_gcc_10_3_iwa_1.745-csw1.7112-L48_mu0.00077-kappa0.140008_7438245.out
GCR: Convergence at 41 iterations, L2 relative residual: iterated = 2.311290e-11, true = 2.311290e-11 (requested = 3.162278e-11)
GCR: Convergence at 90 iterations, L2 relative residual: iterated = 9.769683e-11, true = 9.769683e-11 (requested = 1.000000e-10)
GCR: Convergence at 87 iterations, L2 relative residual: iterated = 9.254241e-11, true = 9.254241e-11 (requested = 1.000000e-10)
GCR: Convergence at 40 iterations, L2 relative residual: iterated = 2.582855e-10, true = 2.582855e-10 (requested = 3.162278e-10)
GCR: Convergence at 39 iterations, L2 relative residual: iterated = 2.299842e-10, true = 2.299842e-10 (requested = 3.162278e-10)
GCR: Convergence at 41 iterations, L2 relative residual: iterated = 2.127263e-10, true = 2.127263e-10 (requested = 3.162278e-10)
GCR: Convergence at 43 iterations, L2 relative residual: iterated = 2.253587e-10, true = 2.253587e-10 (requested = 3.162278e-10)
GCR: Convergence at 41 iterations, L2 relative residual: iterated = 2.028372e-10, true = 2.028372e-10 (requested = 3.162278e-10)
GCR: Convergence at 44 iterations, L2 relative residual: iterated = 3.118630e-10, true = 3.118630e-10 (requested = 3.162278e-10)
GCR: Convergence at 47 iterations, L2 relative residual: iterated = 2.777156e-10, true = 2.777156e-10 (requested = 3.162278e-10)
GCR: Convergence at 47 iterations, L2 relative residual: iterated = 2.486038e-10, true = 2.486038e-10 (requested = 3.162278e-10)
GCR: Convergence at 400 iterations, L2 relative residual: iterated = 2.516956e-06, true = 2.516956e-06 (requested = 1.000000e-10)
GCR: Convergence at 400 iterations, L2 relative residual: iterated = 8.146808e-10, true = 8.146808e-10 (requested = 1.000000e-10)
To be specific, I give the configuration scripts for QUDA and tmLQCD below.
#!/bin/bash
source ../load_modules.sh
CXXFLAGS="-O2 -march=znver2 -mtune=znver2 -mavx2 -mfma" \
CFLAGS="-O2 -march=znver2 -mtune=znver2 -mavx2 -mfma" \
cmake \
-DCMAKE_CXX_COMPILER=mpicxx \
-DCMAKE_C_COMPILER=mpicc \
-DCMAKE_INSTALL_PREFIX="$(pwd)/install_dir" \
-DCMAKE_BUILD_TYPE=RELEASE \
-DQUDA_BUILD_ALL_TESTS=OFF \
-DQUDA_GPU_ARCH=sm_80 \
-DQUDA_INTERFACE_QDP=ON \
-DQUDA_INTERFACE_MILC=OFF \
-DQUDA_MPI=ON \
-DQUDA_DIRAC_WILSON=ON \
-DQUDA_DIRAC_TWISTED_MASS=ON \
-DQUDA_DIRAC_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_NDEG_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_NDEG_TWISTED_MASS=ON \
-DQUDA_DIRAC_CLOVER=ON \
-DQUDA_DIRAC_STAGGERED=OFF \
-DQUDA_MULTIGRID=ON \
-DQUDA_QMP=OFF \
-DQUDA_QIO=OFF \
-DQUDA_DOWNLOAD_USQCD=ON \
${HOME}/code/quda-develop-0a31b227
quda_dir=$(pwd)/../quda-develop-0a31b227/install_dir
source ../load_modules.sh
CC=mpicc CXX=mpicxx F77=f77 \
CFLAGS="-mtune=znver2 -march=znver2 -O3 -mavx2 -mfma -fopenmp -m64" \
CXXFLAGS="-mtune=znver2 -march=znver2 -O3 -mavx2 -mfma -fopenmp -m64" \
LDFLAGS="-fopenmp" \
~/code/tmLQCD-quda_work_HB_solver-4f02cc56/configure \
--enable-quda_experimental \
--enable-mpi \
--enable-omp \
--with-mpidimension=4 \
--disable-sse2 --disable-sse3 \
--with-qudadir=${quda_dir} \
--with-cudadir=${CUDA_ROOT} \
--with-limedir=$(pwd)/../lime/install_dir \
--with-lemondir=$(pwd)/../lemon/install_dir \
--with-lapack="-lopenblas"
#!/bin/bash
source ../load_modules.sh
CXXFLAGS="-O2 -march=znver2 -mtune=znver2 -mavx2 -mfma" \
CFLAGS="-O2 -march=znver2 -mtune=znver2 -mavx2 -mfma" \
cmake \
-DCMAKE_INSTALL_PREFIX="$(pwd)/install_dir" \
-DCMAKE_BUILD_TYPE=RELEASE \
-DQUDA_BUILD_ALL_TESTS=OFF \
-DQUDA_GPU_ARCH=sm_80 \
-DQUDA_INTERFACE_QDP=ON \
-DQUDA_INTERFACE_MILC=OFF \
-DQUDA_MPI=ON \
-DQUDA_DIRAC_WILSON=ON \
-DQUDA_DIRAC_TWISTED_MASS=ON \
-DQUDA_DIRAC_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_NDEG_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_NDEG_TWISTED_MASS=ON \
-DQUDA_DIRAC_CLOVER=ON \
-DQUDA_DIRAC_STAGGERED=OFF \
-DQUDA_MULTIGRID=ON \
-DQUDA_QMP=OFF \
-DQUDA_QIO=OFF \
-DQUDA_DOWNLOAD_USQCD=ON \
${HOME}/code/quda-develop-0a31b227
quda_dir=$(pwd)/../quda-develop-0a31b227/install_dir
source ../load_modules.sh
CC=mpicc CXX=mpicxx F77=f77 \
CFLAGS="-mtune=znver2 -march=znver2 -O3 -mavx2 -mfma -fopenmp -m64 -I${MKLROOT}/include" \
CXXFLAGS="-mtune=znver2 -march=znver2 -O3 -mavx2 -mfma -fopenmp -m64 -I${MKLROOT}/include" \
LDFLAGS="-fopenmp" \
~/code/tmLQCD-quda_work_HB_solver-4f02cc56/configure \
--enable-quda_experimental \
--enable-mpi \
--enable-omp \
--with-mpidimension=4 \
--disable-sse2 --disable-sse3 \
--with-cudadir=$EBROOTCUDA \
--with-qudadir=${quda_dir} \
--with-limedir=$(pwd)/../lime/install_dir \
--with-lemondir=$(pwd)/../lemon/install_dir \
--with-lapack="-Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_lp64.a ${MKLROOT}/lib/intel64/libmkl_gnu_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a -Wl,--end-group -lgomp -lpthread -lm -ldl"
which to me look very similar up to the fact that I link against MKL on Juwels Booster (out of convenience, really, LAPACK is not used in production, it's just required for compilation).
On Juwels Booster I don't specify
-DCMAKE_CXX_COMPILER=mpicxx \
-DCMAKE_C_COMPILER=mpicc \
during QUDA compilation. I will test also this difference.
Update: makes no difference as expected
@simone-romiti I did some more testing. To be precise, I took a conf.save
from your kappa=0.1400080
run corresponding to trajectory nr = 1157
and the nstore_counter
(to make sure that the RNG is initialised the same on both Meluxina and Juwels Booster) and then I tried running from there for a number of trajectories. I used the solver configuration noted above in https://github.com/etmc/tmLQCD/issues/557#issue-1629526412
On Juwels Booster, with all the software stacks listed above, the run immediately hit the MG divergence either at the first inversion or shortly thereafter.
Instead, on Meluxina, it ran through fine for tens of trajectories. Looking at the iteration counts I see a spectrum of between
kappa=0.1400080
)kappa=0.1400086
)and on Juwels Booster extracted from your log files
kappa=0.1400080
)where 400 or 500 of course corresponds to having hit the diverging solver.
Illustrating this (resampling to have the same total count of solver calls for all three data sets) and plotting a count histogram with a logarithmic y scale:
It becomes quite clear that kappa=0.1400080
and kappa=0.1400086
running on Meluxina behave similarly (although the statistics for the kappa=0.1400080
run is low), while the Juwels Booster run of kappa=0.1400080
behaves completely differently with a much larger spread in iteration counts but, strangely, also lower iteration counts for a large number of solves.
I have a strong feeling that there is something wrong on Juwels Booster, but I don't know what. Maybe QUDA's correctness tests (dslash_test
& co.) can shed some light on what's going on.
For completeness the same plot on a non-log scale:
@simone-romiti it seems that moving to commit bb6aac0 of QUDA's develop branch resolves the problems on Juwels Booster. I don't understand why but I haven't had the time to check the differences between what we currently use (0a31b227) and bb6aac0.
As an added bonus you can try the following MG setup:
MGCoarseMuFactor = 2.25, 2.5, 105.0
MGCoarseMaxSolverIterations = 10, 10, 15
MGCoarseSolverTolerance = 0.15, 0.35, 0.25
MGSmootherPostIterations = 1, 5, 1
MGSmootherPreIterations = 1, 0, 0
MGSmootherTolerance = 0.15, 0.10, 0.10
MGOverUnderRelaxationFactor = 0.85, 0.85, 1.00
which is what the auto-tuner produced tuning on conf.0040 to conf.0110 in steps of 10 configs, which, given Nsave = 10
should correspond to 700 trajectories and thus should result in a stable solver. Not sure if the MGCoarseMuFactor = 2.25
on the fine grid makes any sense but oh well.
It's unlikely that we will be able to further investigate this.
@simone-romiti observed in his run on 6 Juwels Booster nodes on a 48c96 lattice at the physical point (
amu=0.00077
) atkappa=0.1400080
that the MG diverges rather frequently after being refreshed. We've observed this behaviour in other runs and were often able to suppress it by employing a slightly different setup or more aggressive MG refreshes (doing more setup iterations during refresh), but the situation was never fully clear.By contrast, a similar run at
kappa=0.1400086
has been running correctly on Meluxina for hundreds of trajectories with very stable iteration counts.The setup in both cases is:
with the only differences being the value of
MGSetup2KappaMu
(of course).For the plot below I've extracted the GCR iterations from all log files of the run on Juwels Booster and the one on Meluxina and then sample a subset of 30000 each (due to
tikzDevice
running into problems with more data points):As one can see, the setup is much less stable running on Juwels Booster, but of course the kappa value is different so it might have to do with this.