biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
558 stars 104 forks source link

Sambamba seems not to work with GNU parallel #412

Closed Rohit-Satyam closed 4 years ago

Rohit-Satyam commented 4 years ago

Sambamba is indeed very fast and has made life easier. However, when I try to amalgamate its fast computation ability with parallelization using GNU Parallel, It seems to somehow fail.

The command line or bash script used:

BSUB -J sort.sh
BSUB -o sort.o
BSUB -e sort.e
BSUB -m node16
BSUB -n 8
BSUB -q regularq
##Location of
tst='/home/parashar/scratch/bwa'

##put the sample name in a file
##The sample file will be stored in current directory

ls -1 $tst/*_filtered.bam | xargs -n 1 basename | uniq > temp2.txt

while read p
do
basename ${p} _filtered.bam  >> testtemp.txt
done < temp2.txt
sort testtemp.txt | uniq > test.txt

cat test.txt | parallel "sambamba sort -o {}_sorted.bam {}_filtered.bam 2> {}.stderr

When I use the above script using Samtools, It seems to work well and sorted bams are produced. However, when I try using sambamba sort using the same script architecture, It doesn't seem to fire multiple jobs. No sorted bam files are produced as output. This problem is specifically with the Sambamba-sort only. Sambamba view-works file with GNU parallel.

Can you explain this behavior, because I have multiple bam files to be sorted?

mschilli87 commented 4 years ago

@Rohit-Satyam:

version of sambamba using: sambamba 0.6.6

This version is over two years old. Is there any chance you could test the latest (or at least a more recent) version? Maybe somebody solved this in the meantime. If this indeed is an issue with sambamba you'd anyways have to update to get any possible fix.


edit: Also, which version of GNU parallel are you using?

pjotrp commented 4 years ago

Looks like a scripting issue rather than a sambamba issue.

Rohit-Satyam commented 4 years ago

Hi! The HPC I was working is down momentarily so cant tell the version of GNU parallel. However, after trying multiple ways, one way worked for me. Previously I was trying to fire 8 jobs parallely over 8 processors. But when I limit the no.s of jobs to be fired at a time to 2-3 using parallel -j 2 , it began to work.

Also when it fired 8 jobs at once, sambamba momentarily start creating multiple temoprary files (above 50) in the same folder. I think running sambamba parallely is computationally expensive and require more memory.

However, I will try once again with the newest version.

mschilli87 commented 4 years ago

@Rohit-Satyam: I think it would be easier to help if you could make it fail consitently with a minimal example that we can run locally. Once your HPC is back, you could use qrsh (or equivalent) to get an interactive shell and run things there. Also, you might want to explicitely tell sambamba sort how many threads to use to ensure you don't exceed the resources you requested for your job.

In general this would be much more appropriate on a more general platform like stackoverflow as it's likely unrelated to sambamba. Also, there you will have many more people reading your post and potentially contributing to a solution.

Rohit-Satyam commented 4 years ago

Hi!

On your recommendation, I was downloading latest version of sambamba. I was downloading the sambamba from here. https://anaconda.org/bioconda/sambamba It says version 0.7 and uploaded Last upload: 4 months and 13 days ago. However, when I install it it shows:

sambamba 0.6.6

This version was built with: LDC 0.17.1 using DMD v2.068.2 using LLVM 3.8.0 bootstrapped with version not available What should I do?

mschilli87 commented 4 years ago

@Rohit-Satyam: Looks like an issue in anaconda then. I have 0.6.8 installed via GNU Guix and get 0.6.8 in sambamba --version.

pjotrp commented 4 years ago

Try the latest release https://github.com/biod/sambamba/releases