biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
563 stars 105 forks source link

bug when downsampling twice with the same seed #428

Closed isthisthat closed 3 years ago

isthisthat commented 4 years ago

This is the culmination of weeks of hunting this bug down.. In the following sequence, the second command will not sub-sample, instead it will output the input bam unchanged (except for an additional header tag):

sambamba -q view -s .01 --subsampling-seed 123 -f bam -o ss1.bam a.bam
sambamba -q view -s .5 --subsampling-seed 123 -f bam -o ss2.bam ss1.bam

The second command will work if the seed is changed, e.g.

sambamba -q view -s .5 --subsampling-seed 124 -f bam -o ss2.bam ss1.bam

The exact same bug exists in samtools as well. So not sure where this is coming from!?

samtools view -s .01 -b -o ss1.bam a.bam
samtools view -s .5 -b -o ss2.bam ss1.bam # doesn't work
samtools view -s 123.5 -b -o ss2.bam ss1.bam # works
isthisthat commented 4 years ago

This is the latest sambamba (0.7.1) and latest samtools (1.10)

isthisthat commented 4 years ago

This is apparently a behaviour documented in the samtools view manual: https://www.htslib.org/doc/samtools-view.html#OPTIONS

pjotrp commented 4 years ago

Yeah. Sambamba mirrors samtools

-s FLOAT

    Output only a proportion of the input alignments. This subsampling acts in the same way on all of the alignment records in the same template or read pair, so it never keeps a read but not its mate.

    The integer and fractional parts of the -s INT.FRAC option are used separately: the part after the decimal point sets the fraction of templates/pairs to be kept, while the integer part is used as a seed that influences which subset of reads is kept.

    When subsampling data that has previously been subsampled, be sure to use a different seed 
value from those used previously; otherwise more reads will be retained than expected.