Open heatherkellyucl opened 1 year ago
Kai and myself have been helping a user on Young [IN06562363] get a working GPU build of the latest QUANTUM Espresso and we also have a Myriad user wanting it [IN06570525]. As we have had to build it ourselves to know how to make it work, it makes sense to install this as a central install on both clusters.
Current latest version is 7.3.1 so updated the title.
My build which has been done on Young is currently running part of the test suite in and interactive session on a Young GPU node with 1 GPU and 4 MPI procs. I'm running them using:
module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9
# To allow the test suite to run
module load python3/recommended
cd /qe-7.3.1-GitHub/test-suite
make run-tests-pw NPROCS=4 2>&1 | tee ../../run-tests-pw.log
I'm currently using the following to build QUANTUM Espresso 7.3.1 on Young. Note the build must be done on a GPU node and not on the login nodes:
module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9
cd ./qe-7.3.1-GitHub
./configure --prefix=XXX/quantum-espresso/7.3.1 --with-cuda=/shared/ucl/apps/nvhpc/2022_221/Linux_x86_64/22.1/cuda --with-cuda-runtime=11.7 --with-cuda-cc=80 --enable-openmp --with-cuda-mpi=yes
make all
make install
The test subset I was running has finally finished:
All done. ERROR: only 244 out of 246 tests passed (1 skipped).
Failed tests in:
/lustre/scratch/ccaabaa/Software/QuantumEspresso/qe-7.3.1-GitHub/test-suite/pw_workflow_exx_nscf/
Skipped test in:
/lustre/scratch/ccaabaa/Software/QuantumEspresso/qe-7.3.1-GitHub/test-suite/pw_metaGGA/
make: *** [run-tests-pw] Error 1
One failed test which needs to be investigated.
I now have a build script for installing into
/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/
running this on Young,
Build finished without errors so I have a job running as ccspapp to run the test suite on a GPU node using:
#$ -pe mpi 4
#$ -l gpu=1
module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9
# To allow the test suite to run
module load python3/recommended
export PATH=${ESPRESSO_ROOT}/bin:$PATH
cd $ESPRESSO_ROOT/test-suite
make run-tests NPROCS=$NSLOTS 2>&1 | tee /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
Job is:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1343927 3.50000 QE-7.3.1_G ccspapp r 04/17/2024 17:03:51 Bran@node-x12t-006.ib.young.uc 4
Job running the test suite finished overnight. The following tests failed:
pw_workflow_exx_nscf - uspp-k-restart-1.in (arg(s): 1): **FAILED**.
Different sets of data extracted from benchmark and test.
Data only in benchmark: ef1, n1, band, e1.
pw_workflow_exx_nscf - uspp-k-restart-2.in (arg(s): 2): **FAILED**.
Different sets of data extracted from benchmark and test.
Data only in benchmark: ef1, n1, band, e1.
All done. ERROR: only 244 out of 246 tests passed (1 skipped).
Failed tests in:
/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/test-suite/pw_workflow_exx_nscf/
Skipped test in:
/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/test-suite/pw_metaGGA/
make: *** [run-tests-pw] Error 1
unfortunately the pw tests failing stopped other tests from starting. Investigating ...
I'm getting:
GPU acceleration is ACTIVE. 1 visible GPUs per MPI rank
GPU-aware MPI enabled
Message from routine print_cuda_info:
High GPU oversubscription detected. Are you sure this is what you want?
for the failed tests.
I successfully ran the failed tests with 2 GPUs so modified the full test job to use 2 GPUs and resubmitted it.
So the job to run the test suite runs the following tests:
cd $ESPRESSO_ROOT/test-suite
# Run all the default set of tests - pw, cp, ph, epw, hp, tddfpt, kcw
make run-tests-pw NPROCS=$NSLOTS 2>&1 | tee /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-cp NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-ph NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-epw NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-hp NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-tddfpt NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-kcw NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
the job took just over two hours to run. All the pw tests passed but some of the other tests failed and will need to be investigated. The log of the tests has been copied to here:
/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/run-tests.log-18042024
I have submitted a longer example job running the pw.x command:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1345043 0.00000 QE-7.3.1_G ccaabaa qw 04/19/2024 10:12:43 8
which requests 4 GPUs and 8 MPI processes.
My example job has run successfully so I'm going to make a module for this version and make it available on Young.
The module file is done and I've submitted a job to test that the module is correctly set up. Will check on Monday.
Test job worked with the module file so I've emailed the Young user (IN06562363) wanting this version.
Will now build the GPU version on Myriad.
Running:
module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9
cd /shared/ucl/apps/build_scripts
./quantum-espresso-7.3.1+git+GPU_install 2>&1 | tee ~/Scratch/Software/QuantumEspresso/quantum-espresso-7.3.1+git+GPU_install.log
on a Myriad A100 GPU node as ccspapp.
Myriad build failed with:
make[1]: Leaving directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/UtilXlib'
cd install ; make -f extlibs_makefile libcuda
make[1]: Entering directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/install'
initializing external/devxlib submodule ...
usage: git submodule [--quiet] add [-b <branch>] [-f|--force] [--name <name>] [--reference <repository>] [--] <repository> [<path>]
or: git submodule [--quiet] status [--cached] [--recursive] [--] [<path>...]
or: git submodule [--quiet] init [--] [<path>...]
or: git submodule [--quiet] deinit [-f|--force] [--] <path>...
or: git submodule [--quiet] update [--init] [--remote] [-N|--no-fetch] [-f|--force] [--rebase] [--reference <repository>] [--merge] [--recursive] [--] [<path>...]
or: git submodule [--quiet] summary [--cached|--files] [--summary-limit <n>] [commit] [--] [<path>...]
or: git submodule [--quiet] foreach [--recursive] <command>
or: git submodule [--quiet] sync [--recursive] [--] [<path>...]
make[1]: *** [libcuda_devxlib] Error 1
make[1]: Leaving directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/install'
make: *** [libcuda] Error 2
I might have been able to fix this problem. Re running the build to see if it works.
The build now runs without errors on Myriad.
I'm now going to submit a job to run the test suite on Myriad:
qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
470426 3.21148 QE-7.3.1_G ccspapp qw 04/23/2024 14:01:16 2
I've done a test build of the CPU MPI variant in my Scratch on Kathleen and run the pw test on 4 cores:
export NSLOTS=4
make run-tests-pw NPROCS=$NSLOTS 2>&1 | tee ~/Scratch/Software/QuantumEspresso/run-tests.log
and got:
All done. 246 out of 246 tests passed (1 skipped).
Now sorting out the build script.
build script for CPU/MPI variant done and running from ccspapp on Kathleen.
./quantum-espresso-7.3.1+git_install 2>&1 | tee ~/Software/QuantumESPRESSO/quantum-espresso-7.3.1+git_install.log
Submitted a longer GPU example job on Myriad - 4 A100 GPUs and 8 MPI procs
My 4 A100 GPUs and 8 MPI procs example works.
I have informed the User who wanted the GPU version on Myriad.
The request for the CPU only variant was from IN06568900 also for Young.
Running the default test suite on Kathleen of this variant has finished. I will now run the build script on Young.
build of the CPU variant on Young has completed. Will run the tests tomorrow.
CPU variant job to run default test suite submitted on Young:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1356143 0.00000 QE-7.3.1_C ccspapp qw 04/26/2024 09:47:48 8
CPU tests ran successfully so producing module file.
module file done and pulled to Young and Kathleen. User wanting the CPU variant informed.
build of CPU variant finished on Myriad late yesterday. Will now run test suite.
CPU/MPI variant test suite job submitted on Myriad:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
571215 0.00000 QE-7.3.1_C ccspapp qw 04/30/2024 10:18:43 4
Made a mistake in my job script so test suite job is:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
571256 0.00000 QE-7.3.1_C ccspapp qw 04/30/2024 10:27:57 4
test suite job ran successfully on Myriad.
CPU variant now installed on Michael. Running an example on the AVX512 nodes.
Documentation page at:
https://www.rc.ucl.ac.uk/docs/Software_Guides/Other_Software/#quantum-espresso
updated.
Test job on Michael AVX512 nodes works.
Now installed on all clusters.
IN:06165073
Spack 0.20 has 7.1 with cuda variant available. (Might be a straightforward update to get it to build 7.2, might not).