cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
135 stars 47 forks source link

CI for GPU #147

Open alazzaro opened 5 years ago

alazzaro commented 5 years ago

This is a place holder for discussion, we really need to implement a CI for GPU.

shoshijak commented 5 years ago

I very much agree. Are there any current obstacles that need to be lifted in order to get a CI for GPU and what needs to be done?

dev-zero commented 5 years ago

my plan is/was:

shoshijak commented 5 years ago

Great. If there's something I can help with, do let me know. Good luck with the approaching deadlines :)

dev-zero commented 5 years ago

Progress report: I have a PoC running with the following Jenkins (@CSCS) pipeline configuration:

node {
   stage('checkout') {
        checkout([$class: 'GitSCM',
            userRemoteConfigs: [[url: 'https://github.com/cp2k/dbcsr.git']],
            branches: [[name: '*/develop']],
            browser: [$class: 'GithubWeb', repoUrl: 'https://github.com/cp2k/dbcsr'],
            doGenerateSubmoduleConfigurations: false,
            extensions: [[$class: 'SubmoduleOption',
                disableSubmodules: false,
                parentCredentials: false,
                recursiveSubmodules: true,
                reference: '',
                trackingSubmodules: false]],
            submoduleCfg: []
        ])

   }

   stage('build&test') {
        sh 'sbatch --account="${JOB_NAME%%/*}" --job-name="${JOB_BASE_NAME}" --wait /users/timuel/job.sh'
   }
}
#!/bin/bash -l

#SBATCH --export=ALL
#SBATCH --exclusive
#SBATCH --constraint="gpu"
#SBATCH --partition="cscsci"
#SBATCH --time="1:00:00"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=2
#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT

set -o errexit
set -o nounset
set -o pipefail

module swap PrgEnv-cray PrgEnv-gnu
module load daint-gpu cudatoolkit CMake/3.12.0
module unload cray-libsci_acc

set -o xtrace

umask 0002  # make sure group members can access the data

mkdir --mode=0775 -p "${SCRATCH}/${BUILD_TAG}"
cd "${SCRATCH}/${BUILD_TAG}"

cmake \
    -DUSE_CUDA=ON \
    -DUSE_CUBLAS=ON \
    -DWITH_GPU=P100 \
    -DMPIEXEC_EXECUTABLE="$(command -v srun)" \
    -DTEST_MPI_RANKS=${SLURM_NTASKS} \
    "${WORKSPACE}" |& tee cmake.out

make VERBOSE=1 -j |& tee make.out

export CRAY_CUDA_MPS=1 # enable the CUDA proxy for MPI+CUDA
export OMP_PROC_BIND=TRUE # set thread affinity
# OMP_NUM_THREADS is set by cmake

# document the current environment
env |& tee env.out

env CTEST_OUTPUT_ON_FAILURE=1 make test |& tee make-test.out

What's left:

Note wrt the handling of slurm errors: sbatch --wait returns a non-0 if the script returned with a non-0. Likewise should it return with non-0 if there was a problem on the scheduler-side itself. So, in both cases should we see a failure in that step. Although, in my tests I had a timelimit-reached termination once which resulted in no step failure.

dev-zero commented 5 years ago

keeping this open for the CI on our infra