TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 33 forks source link

WARNING: No response from dynamic task server. Retrying... #14

Closed oesteban closed 7 years ago

oesteban commented 7 years ago

I'm having this warning infinite times in lonestar5.

Requesting a 12h job, it stood iterating on this warning for 11h 40 min, what left barely 20min to the actual job to run.

What is it I'm doing wrong? Any ideas on how to debug this?

Thanks a lot.

lwilson commented 7 years ago

The limit is set at 10 retries (paramrun line 144), so it should fail at that point (total time: 100 seconds). If you are using the current release or master branch, I can go ahead and test to see what happened.

Did you have the python module loaded when the job was submitted?

Until we can get this sorted out, switch to interleaved (export LAUNCHER_SCHED=interleaved) to avoid this.

oesteban commented 7 years ago
  1. Sorry, you are right. This is not a big deal.

  2. No, but our launcher module has python checking: https://github.com/poldracklab/lmod_modules/blob/master/launcher/3.1.0.lua#L17

  3. I would suggest adding timestamps to launcher commands, so we are very aware of when things happen.

I'm to close this issue since there is nothing to fix, but feel free to reopen. I'll be happy to keep looking at things regarding this issue.

lwilson commented 7 years ago

Hi @oesteban:

We can definitely add timestamps to some of the warning messages (there are already timestamps added to the start and end of every job) to help with debugging.

I've started a new issue (#15) to discuss implementation options.

luisgithub269 commented 5 years ago

Hi, everyone

I need your help plis

The main reason why wrote this mesagge is because i need run a simulation at 500 files each one 5hrs and the resources assigns are 4 nodes, and each node contains 24 cores which allows me to execute 8 tasks simultaneously (two jobs per node, each jobs assign=12 ntasks ) reducing the execution time to 6 days. by

The second reason is the next week I have to deliver my thesis to be able to graduate at chemical engineering at the universidad industrial de santander, colombia

excuse me my writing, I know little about english

The programs used are: orca_4_0_1_2_linux_x86-64_openmpi202.tar.xz openmpi-2.0.2.tar.gz LAUNCHER-TACC version github

I have the next problem with LAUNCHER

the main problen is how assign the resources what the launcher (SCRIPTLAUNCHER.SH) access to transmit a the file (EJECUTAR1.SH) by assign file input ORCA, for fix solution the SLURM resource allocator expects to find the following environment variables:

SLURM_NODELIST
SLURM_TASKS_PER_NODE

However, it was unable to find the following environment variable:

SLURM_NODELIST

when i`m run the program show the next message

-----------------------------------------------------------------------------------

the output at simulation at ORCA (QUANTUNN CHEMICAL)

While trying to determine what resources are available, the SLURM resource allocator expects to find the following environment variables:

SLURM_NODELIST
SLURM_TASKS_PER_NODE

However, it was unable to find the following environment variable:

SLURM_NODELIST

[file orca_main/gtoint.cpp, line 137]: ORCA finished by error termination in OR$

----------------------------------------------------------------------------------

------------------------------------------------------------------------------

the input at ejecutar1.sh

!/bin/bash

Grupo de nodos a utilizar

SBATCH --partition=manycores24

nombre del trabajo

SBATCH -J l1

nombre del archivo de salida

SBATCH -o l1.%j.out

numero de nodos

SBATCH --nodes=1

numero de tareas

SBATCH --ntasks=12

numero de tasks por nodo

SBATCH --tasks-per-node=12

numero de tasks por cpu

SBATCH --cpus-per-task=1

tiempo habilitado para la ejeucucion

SBATCH --time=120:00:00

memoria asignada para cada cpu

SBATCH --mem-per-cpu=8G

solicitud de uso de memoria sin limites

ulimit -l unlimited

variables de openmpi-2.0.2

export PATH=/usr/local/openmpi-2.0.2/bin:$PATH export LD_LIBRARY_PATH=/usr/local/openmpi-2.0.2/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/openmpi-2.0.2/openmpi:$LD_LIBRARY_PATH

variable para llamar a orca

export ORCA_PATH=/home/luis_sc3/plantatrabajo/orca4012 export PATH=$ORCA_PATH:$PATH

variable para llamar direccion del archivo

export FILE_PATH=/scratch/luis_sc3orca/ligandos1/ligandos/l1

$ORCA_PATH/orca $FILE_PATH/l1.inp > $FILE_PATH/l1.out

--------------------------------------------------------------------------------

------------------------------------------------

the input at jobfile=ejecucion ./ejecutar1.sh ./ejecutar2.sh ./ejecutar3.sh

-----------------------------------------------

-----------------------------------------------

the input at simulation at ORCA (QUANTUNN CHEMICAL)

================================================================

l1-Zn Opt

================================================================

! Opt B3LYP def2-SV(P) KDIIS SOSCF

%pal nprocs 12 end

%MaxCore 8000

--------------------------------------------------

in the ouput at launcher

Launcher: Setup complete.

------------- SUMMARY --------------- Number of hosts: 2 Working directory: /scratch/luis_sc3orca/ligandos1/ligandos Processes per host: 2 Total processes: 4 Total jobs: 3 Scheduling method: dynamic


Launcher: Starting parallel tasks... Launcher: Task 0 running job 1 on guane14 (./ejecutar1.sh) Launcher: Task 1 running job 2 on guane14 (#./ejecutar2.sh) Launcher: Job 2 completed in 0 seconds. Launcher: Task 1 running job 3 on guane14 (#./ejecutar3.sh) Launcher: Job 3 completed in 0 seconds. Launcher: Task 1 done. Exiting. [guane14:21063] [[16989,0],0] ORTE_ERROR_LOG: Not found in file base/ras_base_a$ [file orca_main/gtoint.cpp, line 137]: ORCA finished by error termination in OR$

Launcher: Job 1 completed in 0 seconds. Launcher: Task 0 done. Exiting. localhost [127.0.0.1] 9471 (?) : Connection refused localhost [127.0.0.1] 9471 (?) : Connection refused WARNING: No response from dynamic task server. Retrying... localhost [127.0.0.1] 9471 (?) : Connection refused localhost [127.0.0.1] 9471 (?) : Connection refused WARNING: No response from dynamic task server. Retrying... .. . . . .

-----------------------------------------------------------------------------

-----------------------------------------------------------------------------

input at launcher

!/bin/bash

echo "Press CTRL+C to proceed."

trap "pkill -f 'sleep 1h'" INT

trap "set +x ; sleep 1h ; set -x" DEBUG

--------SCHEDULER OPTIONS-------

SBATCH -J parametric

SBATCH -N 2

SBATCH -n 4

SBATCH -p manycores24

SBATCH -w "guane03"

SBATCH -o parametric.o%j

--------GENERAL OPTIONS---------

export LAUNCHER_DIR=/home/luis_sc3/plantatrabajo/launcher #pwd direcccion de la$ export PATH=$LAUNCHER_DIR:$PATH export PATH=/usr/bin/python2.7:$PATH export PATH=/usr/lib/python2.7:$PATH

export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins #administrador de trabajos

export LAUNCHER_RMI=SLURM

export LAUNCHER_NHOSTS=1

export LAUNCHER_NPROCS=24

export LAUNCHER_PPN=2 # de trabajos por nodo export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins #administrador de trabajos export LAUNCHER_RMI=SLURM

export EXECUTABLE=$LAUNCHER_DIR/init_laucher

export LAUNCHER_WORKDIR=$PWD #pwd_of_jobfile export LAUNCHER_JOB_FILE=ejecucion #jobfile

export CONTROL_FILE = #jobfile

--------TASKS SCHEDULING OPTIONS

export LAUNCHER_SCHED=dynamic

--------EXECUTING---------------

$LAUNCHER_DIR/paramrun

---------------------------------------------------------------------------------