hi guys,
i have a problem as user of the Oracle Cloud Infrastructure, let's see if anyone can help.
I have a binary compiled in the login node, a parallel code which uses MPI heavily. I have a slurm script that submit the job loading firstly some modules. What is strange is that if I sbatch the script specifying a BM instance already up and running, i have an error at the MPI init, i.e. at the very beginning. If i do the same to a VM, all works fine. All works fine also if I log in directly to the BM, load the same modules, and run the binary using "mpirun -np ..."
It seems that there is a problem with MPI through slurm in the BM... any hint?
I attach here the slurm script.
thanks!
#!/bin/bash
#SBATCH --job-name="combo"
#SBATCH --time=02:00:00
#SBATCH --ntasks=64
#SBATCH --threads-per-core=1
#SBATCH --output=/mnt/shared/ELEM/data/scars-darrel-test/slurm_outputs/output_%J.out
#SBATCH --error=/mnt/shared/ELEM/data/scars-darrel-test/slurm_outputs/output_%J.err
#### I use this only to test the case when starting the BM beforehand
#SBATCH --nodelist=bm-standard-e2-64-ad1-0003
module purge
module load hwloc
module load pmix
module load prun/1.3
module load gnu8/8.3.0
module load openmpi3/3.1.4
module load ohpc
module load Python/3.6.6-foss-2018b
set -eo pipefail -o nounset
source /etc/profile.d/lmod.sh
export folderdata=/mnt/shared/ELEM/data/scars-darrel-test
export foldertemplate=${folderdata}/data_in
export folderin=${folderdata}/data_in_${SLURM_JOB_ID}
export foldergeom=${folderdata}/geom_in
export folderout=${folderdata}/resu_${SLURM_JOB_ID}
export foldervtkgeom=${folderdata}/vtk-geom-definition
export probname=wedge_scars
export binalya=/mnt/shared/ELEM/bm-standard-e2-64-ad1-0001-cosas/mariano-exmedi-ohara-alya2/Executables/unix/Alya.g
mkdir -p ${folderin}
cp -r ${foldertemplate}/* ${folderin}/.
echo '--|JOB STARTING AT: ' `date`
echo '--| ALYA: STARTING AT: ' `date`
cd ${folderin}
#### I get the error after this:
time -p srun --mpi=pmix ${binalya} ${probname}
echo '--| ALYA: FINISHED AT: ' `date`
echo '--|JOB FINISHED: ' `date`
hi guys, i have a problem as user of the Oracle Cloud Infrastructure, let's see if anyone can help. I have a binary compiled in the login node, a parallel code which uses MPI heavily. I have a slurm script that submit the job loading firstly some modules. What is strange is that if I sbatch the script specifying a BM instance already up and running, i have an error at the MPI init, i.e. at the very beginning. If i do the same to a VM, all works fine. All works fine also if I log in directly to the BM, load the same modules, and run the binary using "mpirun -np ..."
It seems that there is a problem with MPI through slurm in the BM... any hint?
I attach here the slurm script.
thanks!