Closed sp946 closed 4 years ago
hi @sp946 , could you attach the hosts.conf? It looks like your shell is not properly set in the script. Also, what is in /var/spool/slurm/slurmd/job03639/slurm_script?
Hi Gregory, here is the hosts.conf:
[localhost] PARALLELCOMMAND = mpirun -np %(JOBNODES)d -bynode %(COMMAND)s NAME = SLURM MANDATORY = 2 SUBMITCOMMAND = sbatch %(JOB_SCRIPT)s CANCELCOMMAND = scancel %(JOB_ID)s CHECKCOMMAND = squeue -j %(JOB_ID)s SUBMIT_TEMPLATE = #!/bin/bash
#SBATCH -J %_(JOB_NAME)s
### Outputs (we need to escape the job id as %%j)
#SBATCH -o job%%j.out
#SBATCH -e job%%j.err
### Partition (queue) name
### if the system has only 1 queue, it can be omited
### if you want to specify the queue, ensure the name in the scipion dialog matches
### a slurm partition, then leave only 1 # sign in the next line
#SBATCH -p %_(JOB_QUEUE)s
### Specify time, number of nodes (tasks), cores and memory(MB) for your job
#SBATCH --ntasks=%_(JOB_NODES)d --cpus-per-task=%_(JOB_THREADS)d
# Use as working dir the path where sbatch was launched
WORKDIR=$SLURM_JOB_SUBMIT_DIR
#################################
### Set environment varible to know running mode is non interactive
export XMIPP_IN_QUEUE=1
cd $WORKDIR
# Make a copy of node file
cp $SLURM_JOB_NODELIST %_(JOB_NODEFILE)s
# Calculate the number of processors allocated to this run.
NPROCS=`wc -l < $SLURM_JOB_NODELIST`
# Calculate the number of nodes allocated.
NNODES=`uniq $SLURM_JOB_NODELIST | wc -l`
### Display the job context
echo Running on host `hostname`
echo Time is `date`
echo Working directory is `pwd`
echo Using ${NPROCS} processors across ${NNODES} nodes
echo NODE LIST:
cat $SLURM_JOB_NODELIST
#################################
%_(JOB_COMMAND)s
QUEUES = { "tesla": [[]], "geforce": [[]], "quadro": [[]] }
Regarding the other file in /var/... I have no idea. Might it be something related to the installation? Many thanks Simone
Hi Simone, looks like the default slurm config has errors. Try to replace you hosts.conf with the following:
[localhost]
PARALLEL_COMMAND = mpirun -np %_(JOB_NODES)d -bynode %_(COMMAND)s
NAME = SLURM
MANDATORY = 2
SUBMIT_COMMAND = sbatch %_(JOB_SCRIPT)s
CANCEL_COMMAND = scancel %_(JOB_ID)s
CHECK_COMMAND = squeue -j %_(JOB_ID)s
SUBMIT_TEMPLATE = #!/bin/bash
### Job name
#SBATCH -J %_(JOB_NAME)s
### Outputs (we need to escape the job id as %%j)
#SBATCH -o %_(JOB_SCRIPT)s.out
#SBATCH -e %_(JOB_SCRIPT)s.err
### Partition (queue) name
### if the system has only 1 queue, it can be omited
### if you want to specify the queue, ensure the name in the scipion dialog matches
### a slurm partition, then leave only 1 # sign in the next line
##### SBATCH -p %_(JOB_QUEUE)s
### Specify time, number of nodes (tasks), cores and memory(MB) for your job
#SBATCH --ntasks=%_(JOB_NODES)d --cpus-per-task=%_(JOB_THREADS)d
# Use as working dir the path where sbatch was launched
WORKDIR=$SLURM_JOB_SUBMIT_DIR
#################################
### Set environment varible to know running mode is non interactive
export XMIPP_IN_QUEUE=1
cd $WORKDIR
# Make a copy of node file
echo $SLURM_JOB_NODELIST > %_(JOB_NODEFILE)s
### Display the job context
echo Running on host `hostname`
echo Time is `date`
echo Working directory is `pwd`
echo Using $SLURM_NTASKS tasks ($SLURM_CPUS_PER_TASK CPUs each) across $SLURM_JOB_NUM_NODES nodes
echo NODE LIST: $SLURM_JOB_NODELIST
#################################
%_(JOB_COMMAND)s
QUEUES = {
"tesla": [[]],
"geforce": [[]],
"quadro": [[]]
}
Issue was resolved. @pconesa , could you update the docs for the host configuration for slurm? $SLURM_JOB_NODELIST is not a file but a env var in slurm.
Thanks @azazellochg . I'll do it now
Dear all,
I have a problem when trying to setting up the hosts.conf file for running through slurm on our cluster. We use the template that is provided and when I try to run in the cluster it gives the following error:
cp: cannot stat ‘triton-5’: No such file or directory /var/spool/slurm/slurmd/job03639/slurm_script: line 23: triton-5: No such file or directory uniq: triton-5: No such file or directory cat: triton-5: No such file or directory Traceback (most recent call last): File "/usr/local/em/scipion-2.0.0/pyworkflow/apps/pw_protocol_run.py", line 31, in
from pyworkflow.em import
File "/usr/local/em/scipion-2.0.0/pyworkflow/em/init.py", line 31, in
from data import
File "/usr/local/em/scipion-2.0.0/pyworkflow/em/data.py", line 36, in
from convert import ImageHandler
File "/usr/local/em/scipion-2.0.0/pyworkflow/em/convert/init.py", line 28, in
from .image_handler import ImageHandler, DT_FLOAT
File "/usr/local/em/scipion-2.0.0/pyworkflow/em/convert/image_handler.py", line 38, in
sys.exit(-1)
NameError: name 'sys' is not defined
Could you please advice on that? Many thanks in advance Best regards Simone