bioinfologics / satsuma2

FFT cross-correlation based synteny aligner, (re)designed to make full use of parallel computing
41 stars 13 forks source link

Infinite loop: "MAIN: nothing changed, skipping cycle and waiting 3 seconds" #6

Closed arendsee closed 7 years ago

arendsee commented 7 years ago

We ran the following command on our cluster:

    $SATSUMA2_PATH/SatsumaSynteny2 \
         -q $query                 \
         -t $target                \
         -o $outdir                \
         -slaves 4                 \
         -threads 4                \
         -dups 1                   \
         -dump_cycle_matches

With the following contents in satsuma_run.sh

# Script for starting Satsuma jobs on different job submission environments
# Comment out the lines not required
# Usage: satsuma_run.sh <current_path> <kmatch_cmd> <ncpus> <mem> <job_id> <run_synchronously>
# mem should be in Gb, ie. 100Gb = 100

# # no submission system, run process locally either synchronously or asynchronously
# if [ "$6" -eq 1 ]; then
#   eval "$2"
# else
#   eval "$2" &
# fi

# SLURM systems
echo "#!/bin/sh" > slurm_tmp.sh
echo srun --time 96:00:00 $2 >> slurm_tmp.sh
sbatch -t 96:00:00 -p tgac-long -c $3 -J $5 -o ${5}.log -N 1 --mem ${4}G slurm_tmp.sh 

This run produced this output.

Everything starts fine, but then seems to fall into an infinite loop of:

MAIN: nothing changed, skipping cycle and waiting 3 seconds
MAIN: starting iteration 2
WORKQUEUE: matches collected. FW: 0   REV: 0
MAIN: 0 new matches collected
MAIN: nothing changed, skipping cycle and waiting 3 seconds
MAIN: starting iteration 2
WORKQUEUE: matches collected. FW: 0   REV: 0
MAIN: 0 new matches collected
MAIN: nothing changed, skipping cycle and waiting 3 seconds

The output directory contains only these files:

cycle_1.matches
kmatch_results.k11
kmatch_results.k13
kmatch_results.k15
kmatch_results.k17
kmatch_results.k19
kmatch_results.k21
kmatch_results.k23
kmatch_results.k25
kmatch_results.k27
kmatch_results.k29
kmatch_results.k31
satsuma.log

Where satsuma.log is empty.

Also nothing is printed to stderr.

bjclavijo commented 7 years ago

I see a '-p tgac-long' on the slurm command, can that be the reason for the problem? Have you had a chance to re-try that?

What is happening there is that the master is sitting waiting for the slaves to connect and nothing is happening on the slave end, which may well be because the master failed to launch them.

arendsee commented 7 years ago

Thanks, that fixes the infinite loop issue.

The -p tgac-long is present in the satsuma_run.sh file which I decommented and ran directly (without much understanding)

# SLURM systems
#echo "#!/bin/sh" > slurm_tmp.sh
#echo srun $2 >> slurm_tmp.sh
#sbatch -p tgac-long -c $3 -J $5 -o ${5}.log --mem ${4}G slurm_tmp.sh

Would it be better to leave the option out or perhaps add a note in the comments above the SLURM code?

jonwright99 commented 7 years ago

The satsuma_run.sh script allow you to specify what sort (if any) of submission system you are using on your HPC. The default is none, ie. running jobs on the local machine. If you use SLURM, LSF or PBS you should have the relevant section uncommented in the script. I'll add a note in the README to also change the queue name (-p in SLURM) to a queue name that exists on your system.