Could an optional workflow/script be provided to downsample datasets to 1%?

kternus commented 6 years ago

This would be used in cases where non-complex metagenomes do not require a high amount of sequencing coverage to capture all of the necessary information.

It would be an option for dahak users to implement after read filtering and before any downstream analysis takes place.

ctb commented 6 years ago

On Sat, Jun 09, 2018 at 07:34:49AM -0700, Krista Ternus wrote:

This would be used in cases where non-complex metagenomes do not require a high amount of sequencing coverage to capture all of the necessary information.

It would be an option for dahak users to implement after read filtering and before any downstream analysis takes place.

Question: instead of downsampling, what about a streaming identification that reports back when high confidence has been reached? This is something we already have implemented in sourmash (sourmash watch). But maybe this is too limited and we want to enable more than "just" taxonomy.

So,

From the basic "let's downsample" POV, the simplest approach by far is to just take the first 1% of a read file and do whatever you want with that. The only tricky bit is figuring out what 1% is for a large, gzipped reads file.

This subsampling approach works well in theory (since shotgun sequencing is, in theory, random). The problem with this approach is that it's not entirely true

the Illumina base caller does produce non-random files, and in particular there is often a lot of bad sequence in the first few million reads.

We have several interesting downsampling / subsampling approaches already implemented in khmer.

sample-reads-randomly.py is probably the best place to start, as it will do a true random sample and has lots of options for randomly sampling M reads from just the first N reads, etc.
digital normalization is another subsampling approach that preserves k-mer content/graph structure. We have some reasonably fast implementations of that in khmer. The advantage over random subsampling is that you don't have to specify a number like "1%" up front.
there's another approach that could be adapted that uses diginorm-like methods. We implemented it here in a package called 'syrah'.

Taking a step back, then, I would say:

if the goal is only taxonomic identification, simplest might be to use sourmash watch.
if we want an OK-but-not-great 1%, then taking the first 1% of a file is straightforward.
if we want a pretty-good 1% to feed into lots of downstream stuff, then sampling 1% of the reads from the first 10% of the file (sample-reads-randomly) might work well.
if we want a small-but-informative subset of the data, we can bring out the khmer heavy guns and use either of the last two approaches in the list above.

Happy to discuss this on a call.

kternus commented 6 years ago

Thanks for sharing all of those options! It would be good to talk through these options more on a call.

I believe we would want to run multiple dahak workflows beyond taxonomic classification, which would exclude sourmash watch. Although I didn't realize sourmash watch existed until this moment, and that sounds really interesting. I will file away all of those questions for later too.

sample-reads-randomly.py sounds like it's ready to go and better than taking first 1% of a file.

Solutions to create a small-but-informative subset of the data sound compelling, but I'm not sure how challenging it would be to integrate khmer heavy guns into dahak?

dahak-metagenomics / dahak

Could an optional workflow/script be provided to downsample datasets to 1%? #85