UCL-RITS / rcps-buildscripts

Scripts to automate package builds on RC Platforms
MIT License
39 stars 26 forks source link

Install Request: Quantum Espresso 7.3 GPU and CPU variants #551

Open heatherkellyucl opened 1 year ago

heatherkellyucl commented 1 year ago

IN:06165073

Recently on the Quantum Espresso mailing list a group posted impressive performance with the GPU version of the software.

They used the exact same GPUs that are available on the Young Cluster. Would it be possible for you to compile the GPU enabled 7.2 version of the software and to make it available via module load?

Spack 0.20 has 7.1 with cuda variant available. (Might be a straightforward update to get it to build 7.2, might not).

balston commented 7 months ago

Kai and myself have been helping a user on Young [IN06562363] get a working GPU build of the latest QUANTUM Espresso and we also have a Myriad user wanting it [IN06570525]. As we have had to build it ourselves to know how to make it work, it makes sense to install this as a central install on both clusters.

balston commented 7 months ago

Current latest version is 7.3.1 so updated the title.

balston commented 7 months ago

My build which has been done on Young is currently running part of the test suite in and interactive session on a Young GPU node with 1 GPU and 4 MPI procs. I'm running them using:

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

# To allow the test suite to run
module load python3/recommended

cd /qe-7.3.1-GitHub/test-suite
make run-tests-pw NPROCS=4 2>&1  | tee ../../run-tests-pw.log
balston commented 7 months ago

I'm currently using the following to build QUANTUM Espresso 7.3.1 on Young. Note the build must be done on a GPU node and not on the login nodes:

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

cd ./qe-7.3.1-GitHub
./configure --prefix=XXX/quantum-espresso/7.3.1  --with-cuda=/shared/ucl/apps/nvhpc/2022_221/Linux_x86_64/22.1/cuda  --with-cuda-runtime=11.7 --with-cuda-cc=80 --enable-openmp --with-cuda-mpi=yes
make all
make install
balston commented 7 months ago

The test subset I was running has finally finished:

All done. ERROR: only 244 out of 246 tests passed (1 skipped).
Failed tests in:
        /lustre/scratch/ccaabaa/Software/QuantumEspresso/qe-7.3.1-GitHub/test-suite/pw_workflow_exx_nscf/
Skipped test in:
        /lustre/scratch/ccaabaa/Software/QuantumEspresso/qe-7.3.1-GitHub/test-suite/pw_metaGGA/
make: *** [run-tests-pw] Error 1

One failed test which needs to be investigated.

balston commented 7 months ago

I now have a build script for installing into

/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/

running this on Young,

balston commented 7 months ago

Build finished without errors so I have a job running as ccspapp to run the test suite on a GPU node using:


#$ -pe mpi 4
#$ -l gpu=1

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

# To allow the test suite to run
module load python3/recommended

export PATH=${ESPRESSO_ROOT}/bin:$PATH
cd $ESPRESSO_ROOT/test-suite
make run-tests NPROCS=$NSLOTS 2>&1 | tee /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log

Job is:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1343927 3.50000 QE-7.3.1_G ccspapp      r     04/17/2024 17:03:51 Bran@node-x12t-006.ib.young.uc     4
balston commented 7 months ago

Job running the test suite finished overnight. The following tests failed:

pw_workflow_exx_nscf - uspp-k-restart-1.in (arg(s): 1): **FAILED**.
Different sets of data extracted from benchmark and test.
    Data only in benchmark: ef1, n1, band, e1.

pw_workflow_exx_nscf - uspp-k-restart-2.in (arg(s): 2): **FAILED**.
Different sets of data extracted from benchmark and test.
    Data only in benchmark: ef1, n1, band, e1.

All done. ERROR: only 244 out of 246 tests passed (1 skipped).
Failed tests in:
        /lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/test-suite/pw_workflow_exx_nscf/
Skipped test in:
        /lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/test-suite/pw_metaGGA/
make: *** [run-tests-pw] Error 1

unfortunately the pw tests failing stopped other tests from starting. Investigating ...

balston commented 7 months ago

I'm getting:

     GPU acceleration is ACTIVE.  1 visible GPUs per MPI rank
     GPU-aware MPI enabled

     Message from routine print_cuda_info:
     High GPU oversubscription detected. Are you sure this is what you want?

for the failed tests.

balston commented 7 months ago

I successfully ran the failed tests with 2 GPUs so modified the full test job to use 2 GPUs and resubmitted it.

balston commented 7 months ago

So the job to run the test suite runs the following tests:

cd $ESPRESSO_ROOT/test-suite

# Run all the default set of tests - pw, cp, ph, epw, hp, tddfpt, kcw

make run-tests-pw NPROCS=$NSLOTS 2>&1 | tee /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-cp NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-ph NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-epw NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-hp NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-tddfpt NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log
make run-tests-kcw NPROCS=$NSLOTS 2>&1 | tee -a /home/ccspapp/Scratch/QE-7.3.1/GPU_tests/run-tests.log

the job took just over two hours to run. All the pw tests passed but some of the other tests failed and will need to be investigated. The log of the tests has been copied to here:

/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/run-tests.log-18042024

I have submitted a longer example job running the pw.x command:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1345043 0.00000 QE-7.3.1_G ccaabaa      qw    04/19/2024 10:12:43                                    8

which requests 4 GPUs and 8 MPI processes.

balston commented 7 months ago

My example job has run successfully so I'm going to make a module for this version and make it available on Young.

balston commented 7 months ago

The module file is done and I've submitted a job to test that the module is correctly set up. Will check on Monday.

balston commented 7 months ago

Test job worked with the module file so I've emailed the Young user (IN06562363) wanting this version.

Will now build the GPU version on Myriad.

balston commented 7 months ago

Running:

module unload compilers mpi gcc-libs
module load gcc-libs/10.2.0
module load compilers/nvidia/hpc-sdk/22.9

cd /shared/ucl/apps/build_scripts

./quantum-espresso-7.3.1+git+GPU_install 2>&1 | tee ~/Scratch/Software/QuantumEspresso/quantum-espresso-7.3.1+git+GPU_install.log

on a Myriad A100 GPU node as ccspapp.

balston commented 7 months ago

Myriad build failed with:

make[1]: Leaving directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/UtilXlib'
cd install ; make -f extlibs_makefile libcuda
make[1]: Entering directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/install'
initializing external/devxlib submodule ...
usage: git submodule [--quiet] add [-b <branch>] [-f|--force] [--name <name>] [--reference <repository>] [--] <repository> [<path>]
   or: git submodule [--quiet] status [--cached] [--recursive] [--] [<path>...]
   or: git submodule [--quiet] init [--] [<path>...]
   or: git submodule [--quiet] deinit [-f|--force] [--] <path>...
   or: git submodule [--quiet] update [--init] [--remote] [-N|--no-fetch] [-f|--force] [--rebase] [--reference <repository>] [--merge] [--recursive] [--] [<path>...]
   or: git submodule [--quiet] summary [--cached|--files] [--summary-limit <n>] [commit] [--] [<path>...]
   or: git submodule [--quiet] foreach [--recursive] <command>
   or: git submodule [--quiet] sync [--recursive] [--] [<path>...]
make[1]: *** [libcuda_devxlib] Error 1
make[1]: Leaving directory `/lustre/shared/ucl/apps/quantum-espresso/7.3.1-GPU/nvidia-2022-22.9/q-e/install'
make: *** [libcuda] Error 2
balston commented 7 months ago

I might have been able to fix this problem. Re running the build to see if it works.

balston commented 7 months ago

The build now runs without errors on Myriad.

I'm now going to submit a job to run the test suite on Myriad:

qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 470426 3.21148 QE-7.3.1_G ccspapp      qw    04/23/2024 14:01:16                                    2
balston commented 7 months ago

I've done a test build of the CPU MPI variant in my Scratch on Kathleen and run the pw test on 4 cores:

export NSLOTS=4
make run-tests-pw NPROCS=$NSLOTS 2>&1 | tee ~/Scratch/Software/QuantumEspresso/run-tests.log 

and got:

All done. 246 out of 246 tests passed (1 skipped).

Now sorting out the build script.

balston commented 7 months ago

build script for CPU/MPI variant done and running from ccspapp on Kathleen.

 ./quantum-espresso-7.3.1+git_install 2>&1 | tee ~/Software/QuantumESPRESSO/quantum-espresso-7.3.1+git_install.log
balston commented 7 months ago

Submitted a longer GPU example job on Myriad - 4 A100 GPUs and 8 MPI procs

balston commented 7 months ago

My 4 A100 GPUs and 8 MPI procs example works.

balston commented 7 months ago

I have informed the User who wanted the GPU version on Myriad.

balston commented 7 months ago

The request for the CPU only variant was from IN06568900 also for Young.

Running the default test suite on Kathleen of this variant has finished. I will now run the build script on Young.

balston commented 7 months ago

build of the CPU variant on Young has completed. Will run the tests tomorrow.

balston commented 7 months ago

CPU variant job to run default test suite submitted on Young:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1356143 0.00000 QE-7.3.1_C ccspapp      qw    04/26/2024 09:47:48                                    8
balston commented 7 months ago

CPU tests ran successfully so producing module file.

balston commented 6 months ago

module file done and pulled to Young and Kathleen. User wanting the CPU variant informed.

balston commented 6 months ago

build of CPU variant finished on Myriad late yesterday. Will now run test suite.

balston commented 6 months ago

CPU/MPI variant test suite job submitted on Myriad:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 571215 0.00000 QE-7.3.1_C ccspapp      qw    04/30/2024 10:18:43                                    4
balston commented 6 months ago

Made a mistake in my job script so test suite job is:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 571256 0.00000 QE-7.3.1_C ccspapp      qw    04/30/2024 10:27:57                                    4
balston commented 6 months ago

test suite job ran successfully on Myriad.

balston commented 6 months ago

CPU variant now installed on Michael. Running an example on the AVX512 nodes.

balston commented 6 months ago

Documentation page at:

https://www.rc.ucl.ac.uk/docs/Software_Guides/Other_Software/#quantum-espresso

updated.

balston commented 6 months ago

Test job on Michael AVX512 nodes works.

Now installed on all clusters.