Closed arendsee closed 7 years ago
I see a '-p tgac-long' on the slurm command, can that be the reason for the problem? Have you had a chance to re-try that?
What is happening there is that the master is sitting waiting for the slaves to connect and nothing is happening on the slave end, which may well be because the master failed to launch them.
Thanks, that fixes the infinite loop issue.
The -p tgac-long
is present in the satsuma_run.sh
file which I decommented and ran directly (without much understanding)
# SLURM systems
#echo "#!/bin/sh" > slurm_tmp.sh
#echo srun $2 >> slurm_tmp.sh
#sbatch -p tgac-long -c $3 -J $5 -o ${5}.log --mem ${4}G slurm_tmp.sh
Would it be better to leave the option out or perhaps add a note in the comments above the SLURM code?
The satsuma_run.sh script allow you to specify what sort (if any) of submission system you are using on your HPC. The default is none, ie. running jobs on the local machine. If you use SLURM, LSF or PBS you should have the relevant section uncommented in the script. I'll add a note in the README to also change the queue name (-p in SLURM) to a queue name that exists on your system.
We ran the following command on our cluster:
With the following contents in
satsuma_run.sh
This run produced this output.
Everything starts fine, but then seems to fall into an infinite loop of:
The output directory contains only these files:
Where
satsuma.log
is empty.Also nothing is printed to stderr.