Closed lbeltrame closed 10 years ago
Luca; I agree. I don't like the implementation here at all since the categories are so subjective. Practically it only effects two components, adjusting GATK for low depth to allow more calls:
https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/genotype.py#L28
and allowing calling in super high depth regions, which we try to avoid because they are repetitive and cause huge memory usage:
https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/bam/callable.py#L52
For GATK calling, I'll revisit with their new best practices when 3.0 comes out. For the memory usage issues I'm hoping to implement max and minimum depth parameters and sub-sample to maximum depth instead of excluding. For this I'm still looking for a sub-sampler that can subsample to a maximum coverage (as opposed to subsampling to a percentage of reads).
Thanks for initiating this discussion.
Luca; Thanks again for the thoughts. I pushed a new approach where we explicitly define minimum and maximum depth so it's clear where we don't call below and the point where we use downsampling:
https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#experimental-information
Hope this makes everything more obvious and easy to tweak as needed. Thanks again.
The reasoning is simple, depending on the application, the concept of "high" and "low" differ. For example, under what category would exome sequencing and targeted sequencing fall? At the moment it is not very clear.