biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
558 stars 104 forks source link

Subsampling with sambmaba produces wrong number of reads #392

Closed nFursova closed 5 years ago

nFursova commented 5 years ago

Hi,

I am trying to subsample bam files using sambamba and I get somewhat unexpected number of reads in the output file. I am using sambamba 0.6.7 and the following command for subsampling:

sambamba view -h -t 56 -f bam --subsampling-seed=123 -s $fraction $bam -o $name_$varname.bam

My starting file has 97541577 reads. I am trying to subsample a fraction of 0.8137260791. This should result in ~ 79372125 reads. However, the output of sambamba contains 89908779 reads. I don't think there should be anything wrong with the original bam file. At least when I use samtools view -s, I get an expected number of reads, very close to 79372125. I would really appreciate any advice on this issue - is this a known bug or am I doing smth wrong?

Best wishes,

Nadya

nFursova commented 5 years ago

I have tried running -t 1, thinking maybe something happens during parallelization, but got exactly the same number of reads 89908779, which is much larger than I expect.

nFursova commented 5 years ago

I have been running the exactly same command in the past for smaller bam files, and haven't noticed any deviation from the expected number of reads (not more than 0.5%). I am wondering, if maybe the size of the inut file can be causing issues...

isthisthat commented 4 years ago

I've noticed that when sub-sampling the same bam twice, the second run will not throw out the expected number of reads. Are you sub-sampling multiple times?

pjotrp commented 4 years ago

Also discussed in #428