Open kostrzewa opened 1 year ago
As a baseline, let's take a setup which works and is reasonably efficient (but of course slower than the coarse-grid deflated one):
BeginExternalInverter QUDA
MGCoarseMuFactor = 1.0, 1.0, 70.0
MGNumberOfLevels = 3
MGNumberOfVectors = 24, 32
MGSetupSolver = cg
MGSetupSolverTolerance = 5e-7, 5e-7
MGSetupMaxSolverIterations = 1500, 1500
MGSmootherType = cagcr, cagcr, cagcr
MGSmootherTolerance = 0.2, 0.2, 0.2
MGSmootherPreIterations = 0, 0, 0
MGSmootherPostIterations = 4, 4, 4
MGCoarseSolverType = gcr, gcr, cagcr
MgCoarseSolverTolerance = 0.3, 0.2, 0.15
MGCoarseMaxSolverIterations = 30, 30, 30
MGBlockSizesX = 4,2
MGBlockSizesY = 4,2
MGBlockSizesZ = 4,2
MGBlockSizesT = 4,2
MGOverUnderRelaxationFactor = 0.95, 0.95, 0.95
MGVerbosity = silent, silent, silent
# MGVerbosity = summarize, summarize, summarize
MGRunVerify = yes
MGRunLowModeCheck = no
MGRunObliqueProjectionCheck = no
EndExternalInverter
adding
MGRunLowModeCheck = yes
leads to
MG level 0 (GPU): Computing Y field......
MG level 0 (GPU): ....done computing Y field
MG level 0 (GPU): Computing Yhat field......
MG level 0 (GPU): ....done computing Yhat field
MG level 0 (GPU): Checking 0 = (1 - P P^\dagger) v_k for 24 vectors
MG level 0 (GPU): Checking 0 = (1 - P^\dagger P) eta_c
MG level 0 (GPU): Checking 0 = (D_c - P^\dagger D P) (native coarse operator to emulated operator)
MG level 0 (GPU): Checking Deo of preconditioned operator 0 = \hat{D}_c - A^{-1} D_c
MG level 0 (GPU): Checking Doe of preconditioned operator 0 = \hat{D}_c - A^{-1} D_c
MG level 0 (GPU): Checking normality of preconditioned operator
MG level 0 (GPU): Checking normality of residual operator
[...]
[jwb0001:24461:0:24461] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44)
[jwb0001:24460:0:24460] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44)
[jwb0001:24463:0:24463] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44)
==== backtrace (tid: 24460) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000004f01c32 quda::MG::generateEigenVectors() ???:0
2 0x0000000004f02be6 quda::MG::verify() multigrid.cpp:0
3 0x0000000004f06d3a quda::MG::reset() multigrid.cpp:0
4 0x0000000004f090ad quda::MG::MG() ???:0
5 0x0000000004fcf28f quda::multigrid_solver::multigrid_solver() ???:0
6 0x0000000004fcfea8 newMultigridQuda() ???:0
7 0x0000000000577ebe _updateQudaMultigridPreconditioner() ???:0
8 0x0000000000578fe1 _setOneFlavourSolverParam() ???:0
9 0x000000000057916d invert_quda_direct() ???:0
adding
MGRunObliqueProjectionCheck = yes
leads to
MG level 0 (GPU): CG: Convergence at 352 iterations, L2 relative residual: iterated = 4.980119e-07, true = 4.980119e-07 (requested = 5.0000
00e-07)
MG level 0 (GPU): Computing Y field......
MG level 0 (GPU): ....done computing Y field
MG level 0 (GPU): Computing Yhat field......
MG level 0 (GPU): ....done computing Yhat field
MG level 0 (GPU): Checking 0 = (1 - P P^\dagger) v_k for 24 vectors
MG level 1 (GPU): Null vector Oblique Projections : Checking 1 > || (1 - DP(P^dagDP)P^dag) v_k || / || v_k || for 24 vectors
MG level 1 (GPU): Null vector Oblique Projections : ERROR: Orders 2 8 do not match (/p/home/jusers/kostrzewa2/juwels/code/quda-develop-0a31b227/lib/restrictor.cu:263 in Restrict())
(rank 0, host jwb0001.juwels, color_spinor_field.h:1005 in QudaFieldOrder quda::Order_(const char*, const char*, int, const quda::ColorSpinorField&, const quda::ColorSpinorField&)())
MG level 1 (GPU): Null vector Oblique Projections : last kernel called was (name=N4quda4blas5Norm2IdfEE,volume=32x32x32x16,aux=GPU-offline,nParity=2,vol=524288,parity=2,precision=4,order=4,Ns=4,Nc=3,TwistFlavor=1)
BeginExternalInverter QUDA
MGCoarseMuFactor = 1.0, 1.0, 70.0
MGNumberOfLevels = 3
MGNumberOfVectors = 24, 32, 256
MGSetupSolver = cg
MGSetupSolverTolerance = 5e-7, 5e-7
MGSetupMaxSolverIterations = 1500, 1500
MGSmootherType = cagcr, cagcr, cagcr
MGSmootherTolerance = 0.2, 0.2, 0.2
MGSmootherPreIterations = 0, 0, 0
MGSmootherPostIterations = 4, 4, 4
MGCoarseSolverType = gcr, gcr, cagcr
MgCoarseSolverTolerance = 0.3, 0.2, 0.15
MGCoarseMaxSolverIterations = 30, 30, 30
MGBlockSizesX = 4,2
MGBlockSizesY = 4,2
MGBlockSizesZ = 4,2
MGBlockSizesT = 4,2
MGOverUnderRelaxationFactor = 0.95, 0.95, 0.95
MGVerbosity = summarize, summarize, summarize
MGRunVerify = no
MGRunLowModeCheck = no
MGRunObliqueProjectionCheck = no
MGUseEigSolver = no, no, yes
MGEigSolverRequireConvergence = no, no, yes
MGEigSolverType = tr_lanczos, tr_lanczos, tr_lanczos
MGEigSolverSpectrum = smallest_real, smallest_real, smallest_real
MGEigPreserveDeflationSubspace = yes
MGEigSolverNumberOfVectors = 24, 32, 256
MGEigSolverKrylovSubspaceSize = 24, 32, 384
MGEigSolverMaxRestarts = 100, 100, 100
MGEigSolverTolerance = 1e-4, 1e-4, 1e-4
MGEigSolverUseNormOp = no, no, no
MGEigSolverUseDagger = no, no, no
MGEigSolverUsePolynomialAcceleration = no, no, yes
MGEigSolverPolynomialDegree = 100, 100, 200
MGEigSolverPolyMin = 0.06, 0.06, 0.06
MGEigSolverPolyMax = 8.0, 8.0, 8.0
MGSetupCABasisType = chebyshev, chebyshev, chebyshev
MGSetupCABasisSize = 8, 8, 8
# MGSetupCABasisLambdaMin ## default is okay
# MGSetupCABasisLambdaMax ## default is okay
MGCoarseSolverCABasisType = chebyshev, chebyshev, chebyshev
MGCoarseSolverCABasisSize = 8, 8, 8
# MGCoarseSolverCABasisLambdaMin ## default is okay
# MGCoarseSolverCABasisLambdaMax ## default is okay
EndExternalInverter
leads to
MG level 1 (GPU): CG: Convergence at 757 iterations, L2 relative residual: iterated = 4.966781e-07, true = 4.966781e-07 (requested = 5.000000e-07)
MG level 1 (GPU): Computing Y field......
MG level 1 (GPU): ....done computing Y field
MG level 1 (GPU): Computing Yhat field......
MG level 1 (GPU): ....done computing Yhat field
MG level 2 (GPU): ERROR: site unroll not supported for nSpin = 2 nColor = 32 (rank 0, host jwb0001.juwels, reduce_quda.cu:76 in void quda::blas::Reduce<Reducer, store_t, y_store_t, nSpin, coeff_t>::apply(const quda::qudaStream_t&) [with Reducer = quda::blas::Norm2; store_t = short int; y_store_t = short int; int nSpin = 4; coeff_t = double]())
MG level 2 (GPU): last kernel called was (name=N4quda7RNGInitE,volume=2x4x4x2,aux=GPU-offline,vol=64,parity=1,precision=2,order=2,Ns=2,Nc=32,TwistFlavor=1)
which is really quite weird.
Hi Bartek,
I tried this quda commit in PLEGMA with the following modules loaded on the booster
module load Stages/2022 module load GCC/11.2.0 \ OpenMPI/4.1.2 \ CUDA/11.5 \ CMake/3.21.1 \ imkl/2021.4.0 \ HDF5/1.12.1 Python # we don't actually use BLAS / LAPACK for anything producitive and link against MKL because that's what's easily available at JSC. OpenBLAS or plain BLAS / LAPACK would work fine too. export LD_LIBRARY_PATH=/p/project/pines/fpittler/code/OpenBLAS/OpenBLAS/build/lib/libopenblas.a:$LD_LIBRARY_PATH
I am doing a two level MG
`
MG_LEVELS=2
MG_N_VEC="0 24"
MG_BLK_SZE_0="0 4 4 4 4 "
MG_MU_FACTOR_0="1 1.0"
MG_COARSE_SOLVER_1='1 gcr' MG_COARSE_TOL_1='1 0.22' MG_COARSE_MAXITER_1='1 10'
MG_SMOOTHER_TOL='0 0.25 1 0.25' --Q-mg-eig-nKr 1 384 \ --Q-mg-eig-nEv 1 256 \ --Q-mg-eig-nConv 1 256 \ --Q-mg-eig 0 false 1 true \ --Q-mg-preserve-deflation true`
on a 24**3 test lattice. For me the first few UP inversion actually works, when I do an updateMultiGridSetup I
get the following error:
MG level 1 (GPU): ERROR: Requesting 256 eigenvalues with only storage allocated for 0 (rank 0, host jwb0129.juwels, eigensolve_quda.cpp:580 in void quda::EigenSolver::computeEvals(std::vector<quda::ColorSpinorField>&, std::vector<std::complex<double> >&, int)())
which is not the case when I use a previous QUDA commit like
e2415ef161f06f439d5aa46bd1385ad14ea65105
from the develop branch
Hi Ferenc,
thanks for this update. When you say "this commit", do you mean the 0a31b227 (QUDA develop) that I mentioned above?
As for the problem that you see, this is something that I encountered previously: https://github.com/lattice/quda/issues/929 and I thought it had been fixed then via https://github.com/lattice/quda/pull/930
Hi Bartek,
Thank you very much, yes, I have checked with the same commit as you " 0a31b227 (QUDA develop) ". I will check lattice/quda#930. Cheers Ferenc
as discussed this morning I did some tests with the coarse-grid deflated QUDA-MG and definitely find some issues that will require more investigation.
To be precise, I tested with a stack which works for the "coarse-mu-scaled" MG setup (which we use in the HMC and which I used to produce the pion/kaon/eta 2pt functions for Konstantin a few weeks ago).
For this setup, I load:
and I employ
I get various kinds of failures when I attempt to use the coarse-grid deflated solver ranging from
when I disable all setup verification:
to errors like
when the Oblique Projection check is enabled.
Resolving this will require some time as I think there might have been changes in QUDA which maybe need some additional MG parameters to be set on the side of tmLQCD's QUDA interface.