Issues with double-half solver on Meluxina when running a 32c64 lattice on 2 nodes

MultiShiftCG: Converged after 23 iterations
MultiShiftCG:  shift=0, 23 iterations, relative residual: iterated = 2.278744e-05
MultiShiftCG:  shift=1, 23 iterations, relative residual: iterated = 1.219643e-09
MultiShiftCG:  shift=2, 12 iterations, relative residual: iterated = 3.585718e-10
MultiShiftCG:  shift=3, 5 iterations, relative residual: iterated = 1.435186e-09
# QUDA: Refining shift 0: L2 residual inf / 3.162278e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: WARNING: Exceeded maximum iterations 5000
# QUDA: CG: Convergence at 5000 iterations, L2 relative residual: iterated = 8.998338e-07, true = 8.998358e-07 (requested = 3.162278e-11)
# QUDA: Refining shift 1: L2 residual inf / 3.162278e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 11 iterations, L2 relative residual: iterated = 2.945487e-11, true = 2.945487e-11 (requested = 3.162278e-11)
# QUDA: Refining shift 2: L2 residual inf / 3.162278e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 5 iterations, L2 relative residual: iterated = 1.110539e-11, true = 1.110539e-11 (requested = 3.162278e-11)
# QUDA: Refining shift 3: L2 residual inf / 3.162278e-11, heavy quark 0.000000e+00 / 0.000000e+00 (actual / requested)
# QUDA: CG: Convergence at 2 iterations, L2 relative residual: iterated = 6.935647e-12, true = 6.935647e-12 (requested = 3.162278e-11)

and

# TM_QUDA: mu = 0.001500000000, kappa = 0.140064000000, csw = 1.740000000000
# TM_QUDA: Time for reorder_spinor_eo_toQuda 2.832787e-03 s level: 4 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/solve_degenerate/invert_eo_degenerate_quda/reorder_spinor_eo_toQuda
WARNING: Exceeded maximum iterations 1500
CG: Convergence at 1500 iterations, L2 relative residual: iterated = 4.685995e-08, true = 4.686000e-08 (requested = 1.000000e-08)

which doesn't occur in other runs.

This is compiled with the following set of modules:

module load \
  OpenMPI/4.1.1-GCC-10.3.0 \
  CUDA/11.3.1 \
  CMake/3.20.4 \
  UCX-CUDA/1.10.0-GCCcore-10.3.0-CUDA-11.3.1 \
  flex/2.6.4-GCCcore-10.3.0 \
  OpenBLAS/0.3.15-GCC-10.3.0 \
  GCC/10.3.0 \
  Automake/1.16.3-GCCcore-10.3.0 \
  numactl

and with QUDA configured as:

CXXFLAGS="-O2 -march=znver2 -mtune=znver2 -mavx2 -mfma" \
CFLAGS="-O2 -march=znver2 -mtune=znver2 -mavx2 -mfma" \
cmake \
-DCMAKE_CXX_COMPILER=mpicxx \
-DCMAKE_C_COMPILER=mpicc \
-DCMAKE_INSTALL_PREFIX="$(pwd)/install_dir" \
-DCMAKE_BUILD_TYPE=RELEASE \
-DQUDA_BUILD_ALL_TESTS=OFF \
-DQUDA_GPU_ARCH=sm_80 \
-DQUDA_INTERFACE_QDP=ON \
-DQUDA_INTERFACE_MILC=OFF \
-DQUDA_MPI=ON \
-DQUDA_DIRAC_WILSON=ON \
-DQUDA_DIRAC_TWISTED_MASS=ON \
-DQUDA_DIRAC_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_NDEG_TWISTED_CLOVER=ON \
-DQUDA_DIRAC_NDEG_TWISTED_MASS=ON \
-DQUDA_DIRAC_CLOVER=ON \
-DQUDA_DIRAC_STAGGERED=OFF \
-DQUDA_MULTIGRID=ON \
-DQUDA_QMP=OFF \
-DQUDA_QIO=OFF \
-DQUDA_DOWNLOAD_USQCD=ON \
${HOME}/code/quda-develop-0a31b227

and tmLQCD configured as

CC=mpicc CXX=mpicxx F77=f77 \
CFLAGS="-mtune=znver2 -march=znver2 -O3 -mavx2 -mfma -fopenmp  -m64" \
CXXFLAGS="-mtune=znver2 -march=znver2 -O3 -mavx2 -mfma -fopenmp  -m64" \
LDFLAGS="-fopenmp" \
~/code/tmLQCD-quda_work-fbd6808c/configure \
  --enable-quda_experimental \
  --enable-mpi \
  --enable-omp \
  --with-mpidimension=4 \
  --disable-sse2 --disable-sse3 \
  --with-qudadir=${quda_dir} \
  --with-cudadir=${CUDA_ROOT} \
  --with-limedir=$(pwd)/../lime/install_dir \
  --with-lemondir=$(pwd)/../lemon/install_dir \
  --with-lapack="-lopenblas"

I've seen issues like this previously on Marconi 100.

etmc / tmLQCD

Issues with double-half solver on Meluxina when running a 32c64 lattice on 2 nodes #552