Problem with Slurm - Githubissues

sp946 commented 4 years ago

Dear all,

I have a problem when trying to setting up the hosts.conf file for running through slurm on our cluster. We use the template that is provided and when I try to run in the cluster it gives the following error:

cp: cannot stat ‘triton-5’: No such file or directory /var/spool/slurm/slurmd/job03639/slurm_script: line 23: triton-5: No such file or directory uniq: triton-5: No such file or directory cat: triton-5: No such file or directory Traceback (most recent call last): File "/usr/local/em/scipion-2.0.0/pyworkflow/apps/pw_protocol_run.py", line 31, in from pyworkflow.em import File "/usr/local/em/scipion-2.0.0/pyworkflow/em/init.py", line 31, in from data import File "/usr/local/em/scipion-2.0.0/pyworkflow/em/data.py", line 36, in from convert import ImageHandler File "/usr/local/em/scipion-2.0.0/pyworkflow/em/convert/init.py", line 28, in from .image_handler import ImageHandler, DT_FLOAT File "/usr/local/em/scipion-2.0.0/pyworkflow/em/convert/image_handler.py", line 38, in sys.exit(-1) NameError: name 'sys' is not defined

Could you please advice on that? Many thanks in advance Best regards Simone

azazellochg commented 4 years ago

hi @sp946 , could you attach the hosts.conf? It looks like your shell is not properly set in the script. Also, what is in /var/spool/slurm/slurmd/job03639/slurm_script?

sp946 commented 4 years ago

Hi Gregory, here is the hosts.conf:

[localhost] PARALLELCOMMAND = mpirun -np %(JOBNODES)d -bynode %(COMMAND)s NAME = SLURM MANDATORY = 2 SUBMITCOMMAND = sbatch %(JOB_SCRIPT)s CANCELCOMMAND = scancel %(JOB_ID)s CHECKCOMMAND = squeue -j %(JOB_ID)s SUBMIT_TEMPLATE = #!/bin/bash

Job name

    #SBATCH -J %_(JOB_NAME)s
    ### Outputs (we need to escape the job id as %%j)
    #SBATCH -o job%%j.out
    #SBATCH -e job%%j.err
    ### Partition (queue) name
    ### if the system has only 1 queue, it can be omited
    ### if you want to specify the queue, ensure the name in the scipion dialog matches
    ### a slurm partition, then leave only 1 # sign in the next line
    #SBATCH -p %_(JOB_QUEUE)s

    ### Specify time, number of nodes (tasks), cores and memory(MB) for your job
    #SBATCH --ntasks=%_(JOB_NODES)d --cpus-per-task=%_(JOB_THREADS)d
    # Use as working dir the path where sbatch was launched
    WORKDIR=$SLURM_JOB_SUBMIT_DIR

    #################################
    ### Set environment varible to know running mode is non interactive
    export XMIPP_IN_QUEUE=1

    cd $WORKDIR
    # Make a copy of node file
    cp $SLURM_JOB_NODELIST %_(JOB_NODEFILE)s
    # Calculate the number of processors allocated to this run.
    NPROCS=`wc -l < $SLURM_JOB_NODELIST`
    # Calculate the number of nodes allocated.
    NNODES=`uniq $SLURM_JOB_NODELIST | wc -l`

    ### Display the job context
    echo Running on host `hostname`
    echo Time is `date`
    echo Working directory is `pwd`
    echo Using ${NPROCS} processors across ${NNODES} nodes
    echo NODE LIST:
    cat $SLURM_JOB_NODELIST
    #################################
    %_(JOB_COMMAND)s

QUEUES = { "tesla": [[]], "geforce": [[]], "quadro": [[]] }

Regarding the other file in /var/... I have no idea. Might it be something related to the installation? Many thanks Simone

azazellochg commented 4 years ago

Hi Simone, looks like the default slurm config has errors. Try to replace you hosts.conf with the following:

[localhost]
PARALLEL_COMMAND = mpirun -np %_(JOB_NODES)d -bynode %_(COMMAND)s
NAME = SLURM
MANDATORY = 2
SUBMIT_COMMAND = sbatch %_(JOB_SCRIPT)s
CANCEL_COMMAND = scancel %_(JOB_ID)s
CHECK_COMMAND = squeue -j %_(JOB_ID)s
SUBMIT_TEMPLATE = #!/bin/bash
        ### Job name
        #SBATCH -J %_(JOB_NAME)s
        ### Outputs (we need to escape the job id as %%j)
        #SBATCH -o %_(JOB_SCRIPT)s.out
        #SBATCH -e %_(JOB_SCRIPT)s.err
        ### Partition (queue) name
        ### if the system has only 1 queue, it can be omited
        ### if you want to specify the queue, ensure the name in the scipion dialog matches
        ### a slurm partition, then leave only 1 # sign in the next line
        ##### SBATCH -p %_(JOB_QUEUE)s

        ### Specify time, number of nodes (tasks), cores and memory(MB) for your job
        #SBATCH --ntasks=%_(JOB_NODES)d --cpus-per-task=%_(JOB_THREADS)d
        # Use as working dir the path where sbatch was launched
        WORKDIR=$SLURM_JOB_SUBMIT_DIR

        #################################
        ### Set environment varible to know running mode is non interactive
        export XMIPP_IN_QUEUE=1

        cd $WORKDIR
        # Make a copy of node file
        echo $SLURM_JOB_NODELIST > %_(JOB_NODEFILE)s
        ### Display the job context
        echo Running on host `hostname`
        echo Time is `date`
        echo Working directory is `pwd`
        echo Using $SLURM_NTASKS tasks ($SLURM_CPUS_PER_TASK CPUs each) across $SLURM_JOB_NUM_NODES nodes
        echo NODE LIST: $SLURM_JOB_NODELIST
        #################################
        %_(JOB_COMMAND)s
QUEUES = {
"tesla": [[]],
"geforce": [[]],
"quadro": [[]]
}

azazellochg commented 4 years ago

Issue was resolved. @pconesa , could you update the docs for the host configuration for slurm? $SLURM_JOB_NODELIST is not a file but a env var in slurm.

pconesa commented 4 years ago

Thanks @azazellochg . I'll do it now

I2PC / scipion

Problem with Slurm #2057

Job name