dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects
Apache License 2.0
145 stars 37 forks source link

Bug Fix - min_AQ1 is ignored when min_AQ2 is set to zero #164

Closed tedyun closed 5 years ago

tedyun commented 5 years ago

My name is Ted Yun and I'm in Genomics team at Google. My coworkers and I have found that when GLnexus is run with the two different set of parameters: min_AQ1 = 30, min_AQ2 = 0 and min_AQ1 = 0, min_AQ2 = 0, it produces identical results.

I believe the issue is coming from this part of the code where check_AQ always returns true when cfg.min_AQ2 = 0 (regardless of the cfg.min_AQ1 value), because al.second.topAQ.V[1] value is initialized to zero. The suggested changes here check the second inequality only if the copy number of the allele is at least two, which I believe is the definition of the min_AQ2 config.

Please take a look. Thank you!

Regards, Ted

mlin commented 5 years ago

Sorry about the travis failure, I need to fix the way the build works with docker (#149)

Just pulled/pushed your branch to launch https://travis-ci.org/dnanexus-rnd/GLnexus/builds/521600348

mlin commented 5 years ago

Thanks for catching this! Indeed none of our tests covered a case where min_AQ1>0, min_AQ2=0.

What do you think about just ignoring the min_AQ2 threshold if it's set to zero || (cfg.min_AQ2 && al.second.topAQ.V[1] >= cfg.min_AQ2)? This would differ from your solution only in cases where we have two or more allele carriers all with AQ identically zero (doesn't worry me much).

I'm slightly nervous about using copy_number which is estimated in a different way since it depends on the genotype zygosity (1 or 2 copies in each carrier). We can contrive an example such as a single 1/1 genotype leads to copy_number=2 but the topAQ vector still has only one entry.

tedyun commented 5 years ago

Hi Mike, thank you for your quick response! Ah I see - I didn't know about the complexity about copy_number variable.

I'm not sure if I completely understand your suggestion though.. For example, I'm thinking about the case when min_AQ1=30 and min_AQ2=0. If there is an allele that has multiple copies and if its AQs are identically 20, wouldn't that be filtered out by that logic? (because al.second.topAQ.V[0] will be less than 30 and the clause after || will always be false.) Maybe I'm misunderstanding what al.second.topAQ.V vector keep track of.

tedyun commented 5 years ago

Btw, my coworker had another one-liner solution which is changing the default values of the vector here to -1 instead of 0 here https://github.com/dnanexus-rnd/GLnexus/blob/e25a6be613fbd1879477d62c7258a2dcfe8dd896/include/types.h#L291 I felt nervous about changing the default because the vector is being used a lot, but please let me know if you think that'll work.

mlin commented 5 years ago

Yes, I was not thinking clearly with my last suggestion. I like the -1 idea. Let me noodle a little bit if there may be unintended consequences.

tedyun commented 5 years ago

Yep thank you. I updated my branch with the -1 solution.

mlin commented 5 years ago

Continues on #167. Thanks!

coveralls commented 3 years ago

Pull Request Test Coverage Report for Build 1957


Totals Coverage Status
Change from base Build 1953: 6.5%
Covered Lines: 80
Relevant Lines: 80

💛 - Coveralls