bioforensics / yeat

YEAT: Your Everyday Assembly Tool
Other
1 stars 0 forks source link

Adding custom downsample flag #21

Closed danejo3 closed 1 year ago

danejo3 commented 1 year ago

The purpose of this PR is to resolve #17 .

Originally, YEAT would automatically downsample to a x150 coverage. In this PR, a new option was added for users to input their own custom downsampling number using the -d flag.

From my understanding, depending on certain situations, not all reads should conform to an estimated downsample number because you might need more or less total reads.

danejo3 commented 1 year ago

Question: Is there ever a time when you never what to downsample?

I just talked with Diana and she said that, yes, when you are doing metagenomic analysis, you never want to downsample because you need all the reads.

danejo3 commented 1 year ago

From my understanding, depending on certain situations, not all reads should conform to an estimated downsample number because you might need more or less total reads.

From Sean: It comes up the most when we sequence a plasmid. We wind up with 100,000x+ coverage and the way mash calculates genome size breaks down.

standage commented 1 year ago

...when you are doing metagenomic analysis, you never want to downsample because you need all the reads.

Yep. One of the problems with shotgun metagenomics is that taxa are present in the sample at uneven abundances. Generating more reads can only help so much, because sequencing more deeply is going to give you piles of reads from the high abundance taxa that are already well covered, and only a few reads of interest from the low abundance taxa.

I will note that the complexity and scale of some metagenomics samples makes assembly difficult or impossible without subsampling. But in this case, you wouldn't just perform random uniform sampling: you'd apply a digital normalization approach that reduces the coverage of highly abundant sequences but keeps all sequences of low abundance. The result is a drastic reduction in data (often up to 90%) with very little impact on information content (e.g. which k-mers are present to populate the assembly graph). But this is only true for metagenomic assembly: for metagenomic profiling, diginorm would indeed be inappropriate.

danejo3 commented 1 year ago

I think this PR is ready for a review! Adding @lovettse if you have any comments or questions. Thanks!