Open arq5x opened 9 years ago
Agree this is a significant issue, particularly when using ENCODE transcripts because there are so many non-coding RNA, etc transcripts.
Yes this has been a concern of mine as well. Particularly because I have many currently unsolved exome projects that may very well be compound het variants. I haven't determined a completely satisfactory way to deal with it yet.
On Thu, May 21, 2015 at 10:35 AM, Jessica Chong notifications@github.com wrote:
Agree this is a significant issue, particularly when using ENCODE transcripts because there are so many non-coding RNA, etc transcripts.
— Reply to this email directly or view it on GitHub https://github.com/arq5x/gemini/issues/469#issuecomment-104278802.
I am less concerned about non protein-coding transcripts from ENSE<BL, as the code already prioritizes for "protein_coding". That said, when a variant overlaps multiple protein coding genes, this risk still exists. I think it is doable - @brentp and I need to work on the proper SQL query to drive this.
Currently, the auto_*, de novo, and comp_het tools search for variants by gene using solely the variants table. As such, there is a small but concerning risk that causal variants (especially compound heterozygous) could be missed in a given gene when the variant is annotated based on another overlapping gene. Using the variant_impacts table to test every gene that a variant overlaps will address this issue. The challenge then becomes how to choose a representative impact for a given gene/variant combination in the variant_impacts table. We could use an impact ranking system (e.g., LOW, MED, HIGH, plus tiebreakers) to do this.