biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
569 stars 106 forks source link

[feature request] : Subsampling a fixed number of records #262

Closed sklages closed 7 years ago

sklages commented 8 years ago

Hi,

I had a few large BAM files and wanted to write out a fixed number of alignments. But sambamba itself just offers to use a fraction.

Would it be possible to add an option to subsample a fixed number of records?

I am aware that I can workaround this by combination of sambamba with some unix tools. But I think it would be nicer to have this integrated in sambamba.

I wrote this feature request on samtools github issues as well.

best, Sven

lomereiter commented 8 years ago

Hi Sven, You are referring to reservoir sampling, and I agree it is more intuitive. However, there are a few issues with it, compared to the current probabilistic approach:

So it could be added with restrictions, such as the strict requirement for an indexed .bam and some limit on the requested number of reads, above which it would issue a warning at the start and fall back to the inexact sampling.

That said, I still don't completely understand the motivation behind it, other than love for round numbers and intuitiveness. Are there any more profound reasons?

sklages commented 8 years ago

Hi,

if I have a few hundreds of WGS BAM files and i want to do some testing. In this case I'd simply like to subsample a fixed number of records, indepedendently of the overall coverage. That's all. Requiring an index is no problem at all, because it will (in most cases) already be there.

Using 'samtools view' in combination with 'shuf' or similar stuff is also quite memory hungry :-)

And if I want to subsample 100,000 records from a 100GB BAM file I probably need a lot of memory in any case ...

So in any way, this is more a "nice to have" than a "must have". But people are doing such fixed number subsampling, using workarounds. I think it is nicer to have it directly in the "BAM file reading software" ..

best, Sven

mschilli87 commented 8 years ago

Just to make sure I understand, are we talking about something like samtools view -s $(bc <<< "scale=10;<NREADS> / $(samtools view -c <BAM>)") <BAM> without the rounding errors introduced by my bc call?

sklages commented 8 years ago

yep. I'd say so.