dahak-metagenomics / dahak

benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.
https://dahak-metagenomics.github.io/dahak
BSD 3-Clause "New" or "Revised" License
21 stars 4 forks source link

Could an optional workflow/script be provided to downsample datasets to 1%? #85

Open kternus opened 6 years ago

kternus commented 6 years ago

This would be used in cases where non-complex metagenomes do not require a high amount of sequencing coverage to capture all of the necessary information.

It would be an option for dahak users to implement after read filtering and before any downstream analysis takes place.

ctb commented 6 years ago

On Sat, Jun 09, 2018 at 07:34:49AM -0700, Krista Ternus wrote:

This would be used in cases where non-complex metagenomes do not require a high amount of sequencing coverage to capture all of the necessary information.

It would be an option for dahak users to implement after read filtering and before any downstream analysis takes place.

Question: instead of downsampling, what about a streaming identification that reports back when high confidence has been reached? This is something we already have implemented in sourmash (sourmash watch). But maybe this is too limited and we want to enable more than "just" taxonomy.

So,

From the basic "let's downsample" POV, the simplest approach by far is to just take the first 1% of a read file and do whatever you want with that. The only tricky bit is figuring out what 1% is for a large, gzipped reads file.

This subsampling approach works well in theory (since shotgun sequencing is, in theory, random). The problem with this approach is that it's not entirely true

We have several interesting downsampling / subsampling approaches already implemented in khmer.

Taking a step back, then, I would say:

Happy to discuss this on a call.

kternus commented 6 years ago

Thanks for sharing all of those options! It would be good to talk through these options more on a call.

I believe we would want to run multiple dahak workflows beyond taxonomic classification, which would exclude sourmash watch. Although I didn't realize sourmash watch existed until this moment, and that sounds really interesting. I will file away all of those questions for later too.

sample-reads-randomly.py sounds like it's ready to go and better than taking first 1% of a file.

Solutions to create a small-but-informative subset of the data sound compelling, but I'm not sure how challenging it would be to integrate khmer heavy guns into dahak?