TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 32 forks source link

WARNING: No response from dynamic task server. Retrying AND he SLURM resource allocator expects to find the following environment variables #50

Closed luisgithub269 closed 5 years ago

luisgithub269 commented 5 years ago

Hi, everyone

I have the next problem.

----------------------------

While trying to determine what resources are available, the SLURM resource allocator expects to find the following environment variables:

SLURM_NODELIST
SLURM_TASKS_PER_NODE

However, it was unable to find the following environment variable:

SLURM_NODELIST

[file orca_main/gtoint.cpp, line 137]: ORCA finished by error termination in ORCA_GTOInt

Launcher: Setup complete.

------------- SUMMARY --------------- Number of hosts: 2 Working directory: /scratch/luis_sc3orca/ligandos1/ligandos Processes per host: 2 Total processes: 4 Total jobs: 3 Scheduling method: dynamic


Launcher: Starting parallel tasks... Launcher: Task 0 running job 1 on guane14 (srun -N 1 -n 12 ./ejecutar1.sh) Launcher: Task 1 running job 2 on guane14 (# ./ejecutar2.sh) Launcher: Job 2 completed in 0 seconds. Launcher: Task 1 running job 3 on guane14 (# ./ejecutar3.sh) srun: error: Unable to create step for job 114446: More processors requested than permitted Launcher: Job 3 completed in 0 seconds. Launcher: Job 1 completed in 0 seconds. Launcher: Task 1 done. Exiting. Launcher: Task 0 done. Exiting. localhost [127.0.0.1] 9471 (?) : Connection refused /home/luis_sc3/plantatrabajo/launcher/launcher: line 82: [: -gt: unary operator expected localhost [127.0.0.1] 9471 (?) : Connection refused /home/luis_sc3/plantatrabajo/launcher/launcher: line 82: [: -gt: unary operator expected localhost [127.0.0.1] 9471 (?)localhost [127.0.0.1] 9471 (?) : Connection refused : Connection refused

the scritp master

!/bin/bash

--------SCHEDULER OPTIONS-------

SBATCH -J parametric

SBATCH -N 2

SBATCH -n 4

SBATCH -p manycores24

SBATCH -w=guane[09-11]

SBATCH -o parametric.o%j

--------GENERAL OPTIONS---------

export LAUNCHER_DIR=/home/luis_sc3/plantatrabajo/launcher #pwd direcccion de launcher export PATH=$LAUNCHER_DIR:$PATH export PATH=/usr/bin/python2.7:$PATH export PATH=/usr/lib/python2.7:$PATH

export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins #administrador de trabajos

export LAUNCHER_PPN=2 # de trabajos por nodo export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins #administrador de trabajos export LAUNCHER_RMI=SLURM export EXECUTABLE=$LAUNCHER_DIR/init_laucher export LAUNCHER_WORKDIR=$PWD #pwd_of_jobfile export LAUNCHER_JOB_FILE=ejecucion #jobfile

--------TASKS SCHEDULING OPTIONS

export LAUNCHER_SCHED=dynamic

--------EXECUTING---------------

$LAUNCHER_DIR/paramrun

the script at jobfile ./ejecutar1.sh ./ejecutar2.sh ./ejecutar3.sh

the script at ejecutar1.sh

!/bin/bash

Grupo de nodos a utilizar

SBATCH --partition=manycores24

nombre del trabajo

SBATCH -J l1

nombre del archivo de salida

SBATCH -o l1.%j.out

numero de nodos

SBATCH --nodes=1

numero de tareas

SBATCH --ntasks=12

numero de tasks por nodo

SBATCH --tasks-per-node=12

numero de tasks por cpu

SBATCH --cpus-per-task=1

tiempo habilitado para la ejeucucion

SBATCH --time=120:00:00

memoria asignada para cada cpu

SBATCH --mem-per-cpu=8G

solicitud de uso de memoria sin limites

ulimit -l unlimited

variables de openmpi-2.0.2

export PATH=/usr/local/openmpi-2.0.2/bin:$PATH export LD_LIBRARY_PATH=/usr/local/openmpi-2.0.2/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/openmpi-2.0.2/openmpi:$LD_LIBRARY_PATH

variable para llamar a orca

export ORCA_PATH=/home/luis_sc3/plantatrabajo/orca4012 export PATH=$ORCA_PATH:$PATH

variable para llamar direccion del archivo

export FILE_PATH=/scratch/luis_sc3orca/ligandos1/ligandos/l1

$ORCA_PATH/orca $FILE_PATH/l1.inp > $FILE_PATH/l1.out

Any ideas how to fix the problem?

1 2 3 4 5 6 7 8 9

luisgithub269 commented 5 years ago

Hi, everyone

I need your help plis

The main reason why wrote this mesagge is because i need run a simulation at 500 files each one 5hrs and the resources assigns are 4 nodes, and each node contains 24 cores which allows me to execute 8 tasks simultaneously (two jobs per node, each jobs assign=12 ntasks ) reducing the execution time to 6 days. by

The second reason is the next week I have to deliver my thesis to be able to graduate at chemical engineering at the universidad industrial de santander, colombia

excuse me my writing, I know little about english

The programs used are: orca_4_0_1_2_linux_x86-64_openmpi202.tar.xz openmpi-2.0.2.tar.gz LAUNCHER-TACC version github

I have the next problem with LAUNCHER

the main problen is how assign the resources what the launcher (SCRIPTLAUNCHER.SH) access to transmit a the file (EJECUTAR1.SH) by assign file input ORCA, for fix solution the SLURM resource allocator expects to find the following environment variables:

SLURM_NODELIST
SLURM_TASKS_PER_NODE

However, it was unable to find the following environment variable:

SLURM_NODELIST

when i`m run the program show the next message

-----------------------------------------------------------------------------------

the output at simulation at ORCA (QUANTUNN CHEMICAL)

While trying to determine what resources are available, the SLURM resource allocator expects to find the following environment variables:

SLURM_NODELIST
SLURM_TASKS_PER_NODE

However, it was unable to find the following environment variable:

SLURM_NODELIST

[file orca_main/gtoint.cpp, line 137]: ORCA finished by error termination in OR$

----------------------------------------------------------------------------------

------------------------------------------------------------------------------

the input at ejecutar1.sh

!/bin/bash

Grupo de nodos a utilizar

SBATCH --partition=manycores24

nombre del trabajo

SBATCH -J l1

nombre del archivo de salida

SBATCH -o l1.%j.out

numero de nodos

SBATCH --nodes=1

numero de tareas

SBATCH --ntasks=12

numero de tasks por nodo

SBATCH --tasks-per-node=12

numero de tasks por cpu

SBATCH --cpus-per-task=1

tiempo habilitado para la ejeucucion

SBATCH --time=120:00:00

memoria asignada para cada cpu

SBATCH --mem-per-cpu=8G

solicitud de uso de memoria sin limites

ulimit -l unlimited

variables de openmpi-2.0.2

export PATH=/usr/local/openmpi-2.0.2/bin:$PATH export LD_LIBRARY_PATH=/usr/local/openmpi-2.0.2/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/openmpi-2.0.2/openmpi:$LD_LIBRARY_PATH

variable para llamar a orca

export ORCA_PATH=/home/luis_sc3/plantatrabajo/orca4012 export PATH=$ORCA_PATH:$PATH

variable para llamar direccion del archivo

export FILE_PATH=/scratch/luis_sc3orca/ligandos1/ligandos/l1

$ORCA_PATH/orca $FILE_PATH/l1.inp > $FILE_PATH/l1.out

--------------------------------------------------------------------------------

------------------------------------------------

the input at jobfile=ejecucion ./ejecutar1.sh ./ejecutar2.sh ./ejecutar3.sh

-----------------------------------------------

-----------------------------------------------

the input at simulation at ORCA (QUANTUNN CHEMICAL)

================================================================

l1-Zn Opt

================================================================

! Opt B3LYP def2-SV(P) KDIIS SOSCF

%pal nprocs 12 end

%MaxCore 8000

--------------------------------------------------

in the ouput at launcher

Launcher: Setup complete.

------------- SUMMARY --------------- Number of hosts: 2 Working directory: /scratch/luis_sc3orca/ligandos1/ligandos Processes per host: 2 Total processes: 4 Total jobs: 3 Scheduling method: dynamic


Launcher: Starting parallel tasks... Launcher: Task 0 running job 1 on guane14 (./ejecutar1.sh) Launcher: Task 1 running job 2 on guane14 (#./ejecutar2.sh) Launcher: Job 2 completed in 0 seconds. Launcher: Task 1 running job 3 on guane14 (#./ejecutar3.sh) Launcher: Job 3 completed in 0 seconds. Launcher: Task 1 done. Exiting. [guane14:21063] [[16989,0],0] ORTE_ERROR_LOG: Not found in file base/ras_base_a$ [file orca_main/gtoint.cpp, line 137]: ORCA finished by error termination in OR$

Launcher: Job 1 completed in 0 seconds. Launcher: Task 0 done. Exiting. localhost [127.0.0.1] 9471 (?) : Connection refused localhost [127.0.0.1] 9471 (?) : Connection refused WARNING: No response from dynamic task server. Retrying... localhost [127.0.0.1] 9471 (?) : Connection refused localhost [127.0.0.1] 9471 (?) : Connection refused WARNING: No response from dynamic task server. Retrying... .. . . . .

-----------------------------------------------------------------------------

-----------------------------------------------------------------------------

input at launcher

!/bin/bash

echo "Press CTRL+C to proceed."

trap "pkill -f 'sleep 1h'" INT

trap "set +x ; sleep 1h ; set -x" DEBUG

--------SCHEDULER OPTIONS-------

SBATCH -J parametric

SBATCH -N 2

SBATCH -n 4

SBATCH -p manycores24

SBATCH -w "guane03"

SBATCH -o parametric.o%j

--------GENERAL OPTIONS---------

export LAUNCHER_DIR=/home/luis_sc3/plantatrabajo/launcher #pwd direcccion de la$ export PATH=$LAUNCHER_DIR:$PATH export PATH=/usr/bin/python2.7:$PATH export PATH=/usr/lib/python2.7:$PATH

export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins #administrador de trabajos

export LAUNCHER_RMI=SLURM

export LAUNCHER_NHOSTS=1

export LAUNCHER_NPROCS=24

export LAUNCHER_PPN=2 # de trabajos por nodo export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins #administrador de trabajos export LAUNCHER_RMI=SLURM

export EXECUTABLE=$LAUNCHER_DIR/init_laucher

export LAUNCHER_WORKDIR=$PWD #pwd_of_jobfile export LAUNCHER_JOB_FILE=ejecucion #jobfile

export CONTROL_FILE = #jobfile

--------TASKS SCHEDULING OPTIONS

export LAUNCHER_SCHED=dynamic

--------EXECUTING---------------

$LAUNCHER_DIR/paramrun

---------------------------------------------------------------------------------

lwilson commented 5 years ago

I'm going to close this one since it's a duplicate of #51. @luisgithub269 if I'm mistaken please feel free to let me know and I'll re-open it.