clusterinthecloud / support

If you need help with Cluster in the Cloud, this is the right place
2 stars 0 forks source link

MPI parallel code not running via slurm on BM, but running via login #2

Open jazzquezz opened 5 years ago

jazzquezz commented 5 years ago

hi guys, i have a problem as user of the Oracle Cloud Infrastructure, let's see if anyone can help. I have a binary compiled in the login node, a parallel code which uses MPI heavily. I have a slurm script that submit the job loading firstly some modules. What is strange is that if I sbatch the script specifying a BM instance already up and running, i have an error at the MPI init, i.e. at the very beginning. If i do the same to a VM, all works fine. All works fine also if I log in directly to the BM, load the same modules, and run the binary using "mpirun -np ..."

It seems that there is a problem with MPI through slurm in the BM... any hint?

I attach here the slurm script.

thanks!


#!/bin/bash
#SBATCH --job-name="combo"
#SBATCH --time=02:00:00
#SBATCH --ntasks=64
#SBATCH --threads-per-core=1
#SBATCH --output=/mnt/shared/ELEM/data/scars-darrel-test/slurm_outputs/output_%J.out
#SBATCH --error=/mnt/shared/ELEM/data/scars-darrel-test/slurm_outputs/output_%J.err

#### I use this only to test the case when starting the BM beforehand

#SBATCH --nodelist=bm-standard-e2-64-ad1-0003

module purge
module load hwloc
module load pmix
module load prun/1.3
module load gnu8/8.3.0
module load openmpi3/3.1.4
module load ohpc
module load Python/3.6.6-foss-2018b

set -eo pipefail -o nounset
source /etc/profile.d/lmod.sh

export folderdata=/mnt/shared/ELEM/data/scars-darrel-test

export foldertemplate=${folderdata}/data_in
export folderin=${folderdata}/data_in_${SLURM_JOB_ID}
export foldergeom=${folderdata}/geom_in
export folderout=${folderdata}/resu_${SLURM_JOB_ID}
export foldervtkgeom=${folderdata}/vtk-geom-definition

export probname=wedge_scars

export binalya=/mnt/shared/ELEM/bm-standard-e2-64-ad1-0001-cosas/mariano-exmedi-ohara-alya2/Executables/unix/Alya.g

mkdir -p ${folderin}
cp -r ${foldertemplate}/*  ${folderin}/.

echo '--|JOB STARTING AT: ' `date`
echo '--|   ALYA: STARTING AT: ' `date`
cd ${folderin}

#### I get the error after this:

time -p srun --mpi=pmix ${binalya} ${probname}

echo '--|   ALYA: FINISHED AT: ' `date`
echo '--|JOB FINISHED: ' `date`