bioinfologics / satsuma2

FFT cross-correlation based synteny aligner, (re)designed to make full use of parallel computing
41 stars 13 forks source link

Segmentation fault // Connecting to master: Connection refused #37

Open fuesseler opened 2 years ago

fuesseler commented 2 years ago

Hello! I am trying to run SatsumaSynteny2 on SLURM and I keep running into problems. My target is an assembly (size 2GB) and I am trying to identify sex-linked scaffolds in it by aligning to a query of size 17MB. I hope you can help me figure out how to make it work!

the command:
/hpc-cloud/.conda/envs/environment-satsuma2/bin/SatsumaSynteny2 -q SceUnd_NC_056531.fasta -t ref_normalized.fasta -o workdir-satsuma the resources I supplied:

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=200GB

error message: /var/spool/slurmd.spool/job149759/slurm_script: line 45: 128713 Segmentation fault The Kmer log files get created without a problem as far as I can see.

The slave gets launched, but it looks like master and slave can not connect. The SL1.log file displays this message:

Loading query sequence:  SceUnd_NC_056531.fasta
 - Creating query chunks...
select=0        chunks=4116
chunks: 4116
DONE
Loading target sequence:  ref_normalized.fasta
 - Creating target chunks...
select=0        chunks=666370
chunks: 666370
DONE
TIME SPENT ON LOADING: 30
== launching workers ==
== Entering communication loop ==
comm loop for CompNode02.hpc-cloud 3491
worker created, now to work!!!
ERROR connecting to master: Connection refused
ERROR connecting to master: Connection refused

The slurm_tmp.sh file's contents before the job fails display this command, which leads me to believe HomologyByXCorrSlave is where the problem occurs;

srun /hpc-cloud/.conda/envs/environment-satsuma2/bin/HomologyByXCorrSlave -master CompNode02.hpc-cloud -port 3491 -sid 1 -p 1 -q SceUnd_NC_056531.fasta -t ref_normalized.fasta -l 0 -q_chunk 4096 -t_chunk 4096 -min_prob 0.99999 -cutoff 1.8

I am not sure if I configured the _satsumarun.sh right: For "QueueName" is set the partition name that I was planning to run everyhting on. Is there something else I should have configured in order for master and slave to be able to communicate? I never used a program before that uses control and slave processes.

#!/bin/sh

# Script for starting Satsuma jobs on different job submission environments
# One section only should be active, ie. not commented out

# Usage: satsuma_run.sh <current_path> <kmatch_cmd> <ncpus> <mem> <job_id> <run_synchronously>
# mem should be in Gb, ie. 100Gb = 100

# no submission system, processes are run locally either synchronously or asynchronously
#if [ "$6" -eq 1 ]; then
#  eval "$2"
#else
#  eval "$2" &
#fi

##############################################################################################################
## For the sections below you will need to change the queue name (QueueName) to one existing on your system ##
##############################################################################################################

# qsub (PBS systems)
#echo "cd $1; $2" | qsub -V -qQueueName -l ncpus=$3,mem=$4G -N $5

# bsub (LSF systems)
#mem=`expr $4 + 1000`
#bsub -o ${5}.log -J $5 -n $3 -q QueueName -R "rusage[mem=$mem]" "$2"

# SLURM systems
echo "#!/bin/sh" > slurm_tmp.sh
echo srun $2 >> slurm_tmp.sh
sbatch -p Spc -c $3 -J $5 -o ${5}.log --mem ${4}G slurm_tmp.sh

The test dataset behaves the same way. I would appreciate your help! Thanks,