etmc / tmLQCD

tmLQCD is a freely available software suite providing a set of tools to be used in lattice QCD simulations. This is mainly a HMC implementation (including PHMC and RHMC) for Wilson, Wilson Clover and Wilson twisted mass fermions and inverter for different versions of the Dirac operator. The code is fully parallelised and ships with optimisations for various modern architectures, such as commodity PC clusters and the Blue Gene family.
http://www.itkp.uni-bonn.de/~urbach/software.html
GNU General Public License v3.0
32 stars 47 forks source link

issues with coarse-grid deflated QUDA-MG which have appeared "recently" #553

Open kostrzewa opened 1 year ago

kostrzewa commented 1 year ago

as discussed this morning I did some tests with the coarse-grid deflated QUDA-MG and definitely find some issues that will require more investigation.

To be precise, I tested with a stack which works for the "coarse-mu-scaled" MG setup (which we use in the HMC and which I used to produce the pion/kaon/eta 2pt functions for Konstantin a few weeks ago).

For this setup, I load:

module load Stages/2022
module load GCC/11.2.0 \
  OpenMPI/4.1.2 \
  CUDA/11.5 \
  CMake/3.21.1 \
  imkl/2021.4.0 \
  HDF5/1.12.1 \
  Boost/1.78.0

and I employ

I get various kinds of failures when I attempt to use the coarse-grid deflated solver ranging from

MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 24 (rank 2, host jwb0033.juwels, reduce_quda.cu:76 in void quda::blas::Reduce<Reducer, store_t, y_store_t, nSpin, coeff_t>::apply(const quda::qudaStream_t&) [with Reducer = quda::blas::Norm2; store_t = short int; y_store_t = short int; int nSpin = 4; coeff_t = double]())

when I disable all setup verification:

  MGRunVerify = no
  MGRunLowModeCheck = no
  MGRunObliqueProjectionCheck = no

to errors like

MG level 1 (GPU): Null vector Oblique Projections : ERROR: Orders 2 8 do not match  (/p/home/jusers/kostrzewa2/juwels/code/quda-develop-0a3
1b227/lib/restrictor.cu:263 in Restrict())
 (rank 2, host jwb0053.juwels, color_spinor_field.h:1005 in QudaFieldOrder quda::Order_(const char*, const char*, int, const quda::ColorSpi
norField&, const quda::ColorSpinorField&)())
MG level 1 (GPU): Null vector Oblique Projections :        last kernel called was (name=N4quda4blas5Norm2IdfEE,volume=32x32x32x16,aux=GPU-o
ffline,nParity=2,vol=524288,parity=2,precision=4,order=4,Ns=4,Nc=3,TwistFlavor=1)

when the Oblique Projection check is enabled.

Resolving this will require some time as I think there might have been changes in QUDA which maybe need some additional MG parameters to be set on the side of tmLQCD's QUDA interface.

kostrzewa commented 1 year ago

Baseline

As a baseline, let's take a setup which works and is reasonably efficient (but of course slower than the coarse-grid deflated one):

BeginExternalInverter QUDA
  MGCoarseMuFactor = 1.0, 1.0, 70.0
  MGNumberOfLevels = 3
  MGNumberOfVectors = 24, 32
  MGSetupSolver = cg
  MGSetupSolverTolerance = 5e-7, 5e-7
  MGSetupMaxSolverIterations = 1500, 1500
  MGSmootherType = cagcr, cagcr, cagcr
  MGSmootherTolerance = 0.2, 0.2, 0.2
  MGSmootherPreIterations = 0, 0, 0
  MGSmootherPostIterations = 4, 4, 4
  MGCoarseSolverType = gcr, gcr, cagcr
  MgCoarseSolverTolerance = 0.3, 0.2, 0.15
  MGCoarseMaxSolverIterations = 30, 30, 30
  MGBlockSizesX = 4,2
  MGBlockSizesY = 4,2
  MGBlockSizesZ = 4,2
  MGBlockSizesT = 4,2
  MGOverUnderRelaxationFactor = 0.95, 0.95, 0.95
  MGVerbosity = silent, silent, silent
  # MGVerbosity = summarize, summarize, summarize
  MGRunVerify = yes
  MGRunLowModeCheck = no
  MGRunObliqueProjectionCheck = no
EndExternalInverter
kostrzewa commented 1 year ago

Baseline with LowModeCheck (errors out)

adding

  MGRunLowModeCheck = yes

leads to

MG level 0 (GPU): Computing Y field......
MG level 0 (GPU): ....done computing Y field
MG level 0 (GPU): Computing Yhat field......
MG level 0 (GPU): ....done computing Yhat field
MG level 0 (GPU): Checking 0 = (1 - P P^\dagger) v_k for 24 vectors
MG level 0 (GPU): Checking 0 = (1 - P^\dagger P) eta_c
MG level 0 (GPU): Checking 0 = (D_c - P^\dagger D P) (native coarse operator to emulated operator)
MG level 0 (GPU): Checking Deo of preconditioned operator 0 = \hat{D}_c - A^{-1} D_c
MG level 0 (GPU): Checking Doe of preconditioned operator 0 = \hat{D}_c - A^{-1} D_c
MG level 0 (GPU): Checking normality of preconditioned operator
MG level 0 (GPU): Checking normality of residual operator
[...]
[jwb0001:24461:0:24461] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44)
[jwb0001:24460:0:24460] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44)
[jwb0001:24463:0:24463] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44)
==== backtrace (tid:  24460) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000004f01c32 quda::MG::generateEigenVectors()  ???:0
 2 0x0000000004f02be6 quda::MG::verify()  multigrid.cpp:0
 3 0x0000000004f06d3a quda::MG::reset()  multigrid.cpp:0
 4 0x0000000004f090ad quda::MG::MG()  ???:0
 5 0x0000000004fcf28f quda::multigrid_solver::multigrid_solver()  ???:0
 6 0x0000000004fcfea8 newMultigridQuda()  ???:0
 7 0x0000000000577ebe _updateQudaMultigridPreconditioner()  ???:0
 8 0x0000000000578fe1 _setOneFlavourSolverParam()  ???:0
 9 0x000000000057916d invert_quda_direct()  ???:0
kostrzewa commented 1 year ago

Baseline with ObliqueProjectionCheck (errors out)

adding

  MGRunObliqueProjectionCheck = yes

leads to

MG level 0 (GPU): CG: Convergence at 352 iterations, L2 relative residual: iterated = 4.980119e-07, true = 4.980119e-07 (requested = 5.0000
00e-07)
MG level 0 (GPU): Computing Y field......
MG level 0 (GPU): ....done computing Y field
MG level 0 (GPU): Computing Yhat field......
MG level 0 (GPU): ....done computing Yhat field
MG level 0 (GPU): Checking 0 = (1 - P P^\dagger) v_k for 24 vectors
MG level 1 (GPU): Null vector Oblique Projections : Checking 1 > || (1 - DP(P^dagDP)P^dag) v_k || / || v_k || for 24 vectors
MG level 1 (GPU): Null vector Oblique Projections : ERROR: Orders 2 8 do not match  (/p/home/jusers/kostrzewa2/juwels/code/quda-develop-0a31b227/lib/restrictor.cu:263 in Restrict())
 (rank 0, host jwb0001.juwels, color_spinor_field.h:1005 in QudaFieldOrder quda::Order_(const char*, const char*, int, const quda::ColorSpinorField&, const quda::ColorSpinorField&)())
MG level 1 (GPU): Null vector Oblique Projections :        last kernel called was (name=N4quda4blas5Norm2IdfEE,volume=32x32x32x16,aux=GPU-offline,nParity=2,vol=524288,parity=2,precision=4,order=4,Ns=4,Nc=3,TwistFlavor=1)
kostrzewa commented 1 year ago

Coarse-grid deflation basic (errors out)

BeginExternalInverter QUDA
  MGCoarseMuFactor = 1.0, 1.0, 70.0
  MGNumberOfLevels = 3
  MGNumberOfVectors = 24, 32, 256
  MGSetupSolver = cg
  MGSetupSolverTolerance = 5e-7, 5e-7
  MGSetupMaxSolverIterations = 1500, 1500
  MGSmootherType = cagcr, cagcr, cagcr
  MGSmootherTolerance = 0.2, 0.2, 0.2
  MGSmootherPreIterations = 0, 0, 0
  MGSmootherPostIterations = 4, 4, 4
  MGCoarseSolverType = gcr, gcr, cagcr
  MgCoarseSolverTolerance = 0.3, 0.2, 0.15
  MGCoarseMaxSolverIterations = 30, 30, 30
  MGBlockSizesX = 4,2
  MGBlockSizesY = 4,2
  MGBlockSizesZ = 4,2
  MGBlockSizesT = 4,2
  MGOverUnderRelaxationFactor = 0.95, 0.95, 0.95
  MGVerbosity = summarize, summarize, summarize

  MGRunVerify = no
  MGRunLowModeCheck = no
  MGRunObliqueProjectionCheck = no

  MGUseEigSolver = no, no, yes
  MGEigSolverRequireConvergence = no, no, yes

  MGEigSolverType = tr_lanczos, tr_lanczos, tr_lanczos
  MGEigSolverSpectrum = smallest_real, smallest_real, smallest_real
  MGEigPreserveDeflationSubspace = yes
  MGEigSolverNumberOfVectors = 24, 32, 256
  MGEigSolverKrylovSubspaceSize = 24, 32, 384
  MGEigSolverMaxRestarts = 100, 100, 100
  MGEigSolverTolerance = 1e-4, 1e-4, 1e-4
  MGEigSolverUseNormOp  = no, no, no
  MGEigSolverUseDagger = no, no, no

  MGEigSolverUsePolynomialAcceleration = no, no, yes
  MGEigSolverPolynomialDegree = 100, 100, 200
  MGEigSolverPolyMin = 0.06, 0.06, 0.06
  MGEigSolverPolyMax = 8.0, 8.0, 8.0
  MGSetupCABasisType = chebyshev, chebyshev, chebyshev
  MGSetupCABasisSize = 8, 8, 8
  # MGSetupCABasisLambdaMin  ## default is okay
  # MGSetupCABasisLambdaMax  ## default is okay
  MGCoarseSolverCABasisType = chebyshev, chebyshev, chebyshev
  MGCoarseSolverCABasisSize = 8, 8, 8
  # MGCoarseSolverCABasisLambdaMin  ## default is okay
  # MGCoarseSolverCABasisLambdaMax  ## default is okay
EndExternalInverter

leads to

MG level 1 (GPU): CG: Convergence at 757 iterations, L2 relative residual: iterated = 4.966781e-07, true = 4.966781e-07 (requested = 5.000000e-07)
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): ....done computing Y field
MG level 1 (GPU): Computing Yhat field......
MG level 1 (GPU): ....done computing Yhat field
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host jwb0001.juwels, reduce_quda.cu:76 in void quda::blas::Reduce<Reducer, store_t, y_store_t, nSpin, coeff_t>::apply(const quda::qudaStream_t&) [with Reducer = quda::blas::Norm2; store_t = short int; y_store_t = short int; int nSpin = 4; coeff_t = double]())
MG level 2 (GPU):        last kernel called was (name=N4quda7RNGInitE,volume=2x4x4x2,aux=GPU-offline,vol=64,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)

which is really quite weird.

pittlerf commented 1 year ago

Hi Bartek, I tried this quda commit in PLEGMA with the following modules loaded on the booster module load Stages/2022 module load GCC/11.2.0 \ OpenMPI/4.1.2 \ CUDA/11.5 \ CMake/3.21.1 \ imkl/2021.4.0 \ HDF5/1.12.1 Python # we don't actually use BLAS / LAPACK for anything producitive and link against MKL because that's what's easily available at JSC. OpenBLAS or plain BLAS / LAPACK would work fine too. export LD_LIBRARY_PATH=/p/project/pines/fpittler/code/OpenBLAS/OpenBLAS/build/lib/libopenblas.a:$LD_LIBRARY_PATH I am doing a two level MG ` MG_LEVELS=2 MG_N_VEC="0 24" MG_BLK_SZE_0="0 4 4 4 4 " MG_MU_FACTOR_0="1 1.0"

MG_COARSE_SOLVER_1='1 gcr' MG_COARSE_TOL_1='1 0.22' MG_COARSE_MAXITER_1='1 10'

MG_SMOOTHER_TOL='0 0.25 1 0.25' --Q-mg-eig-nKr 1 384 \ --Q-mg-eig-nEv 1 256 \ --Q-mg-eig-nConv 1 256 \ --Q-mg-eig 0 false 1 true \ --Q-mg-preserve-deflation true`

on a 24**3 test lattice. For me the first few UP inversion actually works, when I do an updateMultiGridSetup I get the following error: MG level 1 (GPU): ERROR: Requesting 256 eigenvalues with only storage allocated for 0 (rank 0, host jwb0129.juwels, eigensolve_quda.cpp:580 in void quda::EigenSolver::computeEvals(std::vector<quda::ColorSpinorField>&, std::vector<std::complex<double> >&, int)()) which is not the case when I use a previous QUDA commit like e2415ef161f06f439d5aa46bd1385ad14ea65105 from the develop branch

kostrzewa commented 1 year ago

Hi Ferenc,

thanks for this update. When you say "this commit", do you mean the 0a31b227 (QUDA develop) that I mentioned above?

As for the problem that you see, this is something that I encountered previously: https://github.com/lattice/quda/issues/929 and I thought it had been fixed then via https://github.com/lattice/quda/pull/930

pittlerf commented 1 year ago

Hi Bartek,

Thank you very much, yes, I have checked with the same commit as you " 0a31b227 (QUDA develop) ". I will check lattice/quda#930. Cheers Ferenc