ArjunaCluster / ArjunaUsers

Arjuna Public Documentation for Users
https://arjunacluster.github.io/ArjunaUsers/
14 stars 7 forks source link

submitting slurm job #223

Closed fujah06 closed 1 year ago

fujah06 commented 1 year ago

Your Name

Olabode

Andrew ID

oajenifu

Where it Happened

JobID : 287720 Node: f[014-015]

What Happened?

When I submit a job, I got this message This version of Spack (openmpi ~legacylaunchers schedulers=slurm) is installed without the mpiexec/mpirun commands to prevent unintended performance issues. See https://github.com/spack/spack/pull/10340 for more details. If you understand the potential consequences of a misconfigured mpirun, you can use spack to install 'openmpi+legacylaunchers' to restore the executables. Otherwise, use srun to launch your MPI executables.

Steps to reproduce

No response

Job Submission Script

#!/bin/sh

#SBATCH -J bode # Job name

#SBATCH -n 54 # Number of total cores

#SBATCH -N 2 # Number of nodes

#SBATCH -A barati

#SBATCH -p cpu

#SBATCH --mem-per-cpu=4000 # Memory pool for all cores in MB

#SBATCH -e /home/oajenifu/OpenFOAM/root-v2012/run/200W_1ms_Powder_constant_NS_Tempdemp/check.err
#SBATCH -o /home/oajenifu/OpenFOAM/root-v2012/run/200W_1ms_Powder_constant_NS_Tempdemp/check.out
#SBATCH --mail-type=END,BEGIN # Type of email notification- BEGIN,END,FAIL,ALL

#SBATCH --mail-user=oajenifu@andrew.cmu.edu # Email to which notifications will be sent

#SBATCH -t 7-0:00

echo "Job started on `hostname` at `date`"

module load gcc/11.2.0

blockMesh
setFields
mpirun -np 54 newIcoReactingMultiphaseInterFoam -parallel> log.out_$bode

echo " "
echo "Job Ended at `date`"

What I've tried

No response

awadell1 commented 1 year ago

Use srun --mpi=pmix instead of mpirun

See the linked spack issue for more info

fujah06 commented 1 year ago

I attempted that but this came out var/spool/slurmd/job287748/slurm_script: line 28: srun: command not found /var/spool/slurmd/job287748/slurm_script: line 29: srun: command not found

awadell1 commented 1 year ago

Let's see that script, entire log and which node was that on?

awadell1 commented 1 year ago

srun is located at /usr/local/sbin/srun, which should be on your path unless you're resetting it in your ~/.bashrc

fujah06 commented 1 year ago

Here is the current script I'm using for submission

!/bin/sh

SBATCH -J bode # Job name

SBATCH -n 54 # Number of total cores

SBATCH -N 2 # Number of nodes

SBATCH -A barati

SBATCH -p cpu

SBATCH --mem-per-cpu=4000 # Memory pool for all cores in MB

-p cpu -n 1 -N 1 --mem=1000M -A barati --pty /bin/bash

SBATCH -e /home/oajenifu/OpenFOAM/root-v2012/run/200W_1ms_Powder_constant_NS_Tempdemp/check.err

SBATCH -o /home/oajenifu/OpenFOAM/root-v2012/run/200W_1ms_Powder_constant_NS_Tempdemp/check.out

SBATCH --mail-type=END,BEGIN # Type of email notification- BEGIN,END,FAIL,ALL

SBATCH --mail-user=oajenifu@andrew.cmu.edu # Email to which notifications will be sent

SBATCH -t 7-0:00

echo "Job started on hostname at date"

module load gcc/11.2.0

blockMesh srun --mpi=pmix -n 2 setFields -parallel> log.out$bode srun --mpi=pmix -n 2 newIcoReactingMultiphaseInterFoam -parallel> log.out$bode

echo " " echo "Job Ended at date"

fujah06 commented 1 year ago

oajenifu@coe 200W_1ms_Powder_constant_NS_Tempdemp]$ squeue -u oajenifu JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 287705 cpu bode oajenifu R 0:06 2 f[014-015]

fujah06 commented 1 year ago

This is what I have in the path you mentioned

[oajenifu@coe 200W_1ms_Powder_constant_NS_Tempdemp]$ cd /usr/local/sbin [oajenifu@coe sbin]$ ls gpart slurmctld slurmd slurmdbd slurmstepd

awadell1 commented 1 year ago

I would echo your path in the job and confirm /usr/local/sbin is on it. If it's not, you're probs modifying PATH in ~/.bashrc

PATH="~/.local/bin"                   # Don't do this, it clears your PATH
PATH="~/.local/bin:$PATH"      # Do this, it prepends to your path

This is what I have in the path you mentioned ....

Yeah, that's on the headnode, which is not identical to the worker nodes.

fujah06 commented 1 year ago

Just to be clear I have my script attached. Is this what you meant in the instruction you wrote? I have my path after the blockmesh below.

!/bin/sh

SBATCH -J bode # Job name

SBATCH -n 54 # Number of total cores

SBATCH -N 2 # Number of nodes

SBATCH -A barati

SBATCH -p cpu

SBATCH --mem-per-cpu=4000 # Memory pool for all cores in MB

-p cpu -n 1 -N 1 --mem=1000M -A barati --pty /bin/bash

SBATCH -e /home/oajenifu/OpenFOAM/root-v2012/run/200W_1ms_Powder_constant_NS_Tempdemp/check.err

SBATCH -o /home/oajenifu/OpenFOAM/root-v2012/run/200W_1ms_Powder_constant_NS_Tempdemp/check.out

SBATCH --mail-type=END,BEGIN # Type of email notification- BEGIN,END,FAIL,ALL

SBATCH --mail-user=oajenifu@andrew.cmu.edu # Email to which notifications will be sent

SBATCH -t 7-0:00

echo "Job started on hostname at date"

module load gcc/11.2.0

blockMesh echo "PATH="~/.local/bin:$PATH"" srun --mpi=pmix -n 2 setFields -parallel> log.out$bode srun --mpi=pmix -n 2 newIcoReactingMultiphaseInterFoam -parallel> log.out$bode

echo " " echo "Job Ended at date"

awadell1 commented 1 year ago

No that was an example of how to set PATH

Please confirm that you don't overwrite the default PATH

fujah06 commented 1 year ago

Hi Alex, so for openfoam I had to source the bashrc present in its folder to have it active maybe thats what you are referring that, other than that, I dont remember changing the path for example source "/home/oajenifu/OpenFOAM/OpenFOAM-v2012/etc/bashrc" it require that to build the .. .software. It has been installed, using slurm or srun to submit the job is the issue..

if the path you wrote was an example, are you referring to the path that has to do with srun you wrote early(/usr/local/sbin/srun), which will be echo "PATH="~/.local/sbin/srun:$PATH"" on the script file as I showed before.

awadell1 commented 1 year ago

Hi Alex, so for openfoam I had to source the bashrc present in its folder to have it active maybe thats what you are referring that, other than that, I dont remember changing the path

I suspect that bashrc removed srun from your path. Again it would be good to confirm this by putting echo $PATH in the submission script to verify that /usr/local/sbin is listed.

which will be echo "PATH="~/.local/sbin/srun:$PATH" ...

No. ~ is your home directory (/home/oajenifu). You want PATH="/usr/local/sbin:$PATH". Note, we're adding a directory not a file to the path.

if there is any confusion, we could meet in person or zoom for 5 minutes, that should resolve it.

I encourage you to contact your group members for more extensive support. The Admin Team are all grad students, and unsurprisingly we don't have a ton of bandwidth. Please open issues as you encounter them, but please understand that we are not a support team.