bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 354 forks source link

Define better the coverage definitions, from low to super-high #322

Closed lbeltrame closed 10 years ago

lbeltrame commented 10 years ago

The reasoning is simple, depending on the application, the concept of "high" and "low" differ. For example, under what category would exome sequencing and targeted sequencing fall? At the moment it is not very clear.

chapmanb commented 10 years ago

Luca; I agree. I don't like the implementation here at all since the categories are so subjective. Practically it only effects two components, adjusting GATK for low depth to allow more calls:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/genotype.py#L28

and allowing calling in super high depth regions, which we try to avoid because they are repetitive and cause huge memory usage:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py#L52

For GATK calling, I'll revisit with their new best practices when 3.0 comes out. For the memory usage issues I'm hoping to implement max and minimum depth parameters and sub-sample to maximum depth instead of excluding. For this I'm still looking for a sub-sampler that can subsample to a maximum coverage (as opposed to subsampling to a percentage of reads).

Thanks for initiating this discussion.

chapmanb commented 10 years ago

Luca; Thanks again for the thoughts. I pushed a new approach where we explicitly define minimum and maximum depth so it's clear where we don't call below and the point where we use downsampling:

https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#experimental-information

Hope this makes everything more obvious and easy to tweak as needed. Thanks again.