Open alazzaro opened 5 years ago
I very much agree. Are there any current obstacles that need to be lifted in order to get a CI for GPU and what needs to be done?
my plan is/was:
Great. If there's something I can help with, do let me know. Good luck with the approaching deadlines :)
Progress report: I have a PoC running with the following Jenkins (@CSCS) pipeline configuration:
node {
stage('checkout') {
checkout([$class: 'GitSCM',
userRemoteConfigs: [[url: 'https://github.com/cp2k/dbcsr.git']],
branches: [[name: '*/develop']],
browser: [$class: 'GithubWeb', repoUrl: 'https://github.com/cp2k/dbcsr'],
doGenerateSubmoduleConfigurations: false,
extensions: [[$class: 'SubmoduleOption',
disableSubmodules: false,
parentCredentials: false,
recursiveSubmodules: true,
reference: '',
trackingSubmodules: false]],
submoduleCfg: []
])
}
stage('build&test') {
sh 'sbatch --account="${JOB_NAME%%/*}" --job-name="${JOB_BASE_NAME}" --wait /users/timuel/job.sh'
}
}
#!/bin/bash -l
#SBATCH --export=ALL
#SBATCH --exclusive
#SBATCH --constraint="gpu"
#SBATCH --partition="cscsci"
#SBATCH --time="1:00:00"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=2
#SBATCH --ntasks-per-core=1 # 1=no HT, 2=HT
set -o errexit
set -o nounset
set -o pipefail
module swap PrgEnv-cray PrgEnv-gnu
module load daint-gpu cudatoolkit CMake/3.12.0
module unload cray-libsci_acc
set -o xtrace
umask 0002 # make sure group members can access the data
mkdir --mode=0775 -p "${SCRATCH}/${BUILD_TAG}"
cd "${SCRATCH}/${BUILD_TAG}"
cmake \
-DUSE_CUDA=ON \
-DUSE_CUBLAS=ON \
-DWITH_GPU=P100 \
-DMPIEXEC_EXECUTABLE="$(command -v srun)" \
-DTEST_MPI_RANKS=${SLURM_NTASKS} \
"${WORKSPACE}" |& tee cmake.out
make VERBOSE=1 -j |& tee make.out
export CRAY_CUDA_MPS=1 # enable the CUDA proxy for MPI+CUDA
export OMP_PROC_BIND=TRUE # set thread affinity
# OMP_NUM_THREADS is set by cmake
# document the current environment
env |& tee env.out
env CTEST_OUTPUT_ON_FAILURE=1 make test |& tee make-test.out
What's left:
job.sh
readFile
+ echo
)sbatch
still returns an error code of 0, probably also for other sorts of errors in the scheduler)Note wrt the handling of slurm errors: sbatch --wait
returns a non-0 if the script returned with a non-0. Likewise should it return with non-0 if there was a problem on the scheduler-side itself. So, in both cases should we see a failure in that step. Although, in my tests I had a timelimit-reached termination once which resulted in no step failure.
keeping this open for the CI on our infra
This is a place holder for discussion, we really need to implement a CI for GPU.