arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

Inheritance model tools should test for every gene using variant_impacts #469

Open arq5x opened 9 years ago

arq5x commented 9 years ago

Currently, the auto_*, de novo, and comp_het tools search for variants by gene using solely the variants table. As such, there is a small but concerning risk that causal variants (especially compound heterozygous) could be missed in a given gene when the variant is annotated based on another overlapping gene. Using the variant_impacts table to test every gene that a variant overlaps will address this issue. The challenge then becomes how to choose a representative impact for a given gene/variant combination in the variant_impacts table. We could use an impact ranking system (e.g., LOW, MED, HIGH, plus tiebreakers) to do this.

jxchong commented 9 years ago

Agree this is a significant issue, particularly when using ENCODE transcripts because there are so many non-coding RNA, etc transcripts.

dgaston commented 9 years ago

Yes this has been a concern of mine as well. Particularly because I have many currently unsolved exome projects that may very well be compound het variants. I haven't determined a completely satisfactory way to deal with it yet.

On Thu, May 21, 2015 at 10:35 AM, Jessica Chong notifications@github.com wrote:

Agree this is a significant issue, particularly when using ENCODE transcripts because there are so many non-coding RNA, etc transcripts.

— Reply to this email directly or view it on GitHub https://github.com/arq5x/gemini/issues/469#issuecomment-104278802.

arq5x commented 9 years ago

I am less concerned about non protein-coding transcripts from ENSE<BL, as the code already prioritizes for "protein_coding". That said, when a variant overlaps multiple protein coding genes, this risk still exists. I think it is doable - @brentp and I need to work on the proper SQL query to drive this.