arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 118 forks source link

allow using canonical transcript instead of highest impact #663

Open brentp opened 8 years ago

brentp commented 8 years ago

see: https://groups.google.com/forum/#!topic/gemini-variation/KKCO05-RNYo

roryk commented 6 years ago

Hey ya'll,

What do you think we'd have to do to implement this? I've had some folks we work with asking about it.

udp3f commented 6 years ago

I like Brent's idea of '--use-canonical' flag for those who want ranking based on canonical transcripts. The default can still remain the same.

roryk commented 6 years ago

Is the canonical transcript flagged in the snpEFF/VEP output and folded into the INFO field somehow?

udp3f commented 6 years ago

Yes there's a '--canonical' flag in VEP and that should be available in INFO. Not sure about snpEff though.

brentp commented 6 years ago

as long as there is a way to know which is the Canonical (is it put first?) This should be implemented by adding a flag to gemini and then passing it to geneimpacts module which does the variant prioritization.

jxchong commented 6 years ago

++ particularly on being able to make hgvs use the canonical transcript

davemcg commented 6 years ago

Is there any kind of workaround we could implement before this feature gets added? Perhaps some way to get the VEP --pick (https://useast.ensembl.org/info/docs/tools/vep/script/vep_options.html) information added to variant_impact?

oleraj commented 5 years ago

@brentp I've run into this issue as well and agree this would be nice to fix. As mentioned, VEP has a "canonical" field which is in the INFO column (and gets imported into gemini as vep_canonical). For SnpEff, they have a -canon option which only annotates the canonical transcript (which seems similar to the --pick option in VEP), but I'm not sure that's what users will want to do.

Instead, I wonder how difficult it would be to determine which is the canonical transcript on the fly in the geneimpacts module. Here is the simple rule that SnpEff uses to determine canonical:

"Canonical transcripts are defined as the longest CDS of amongst the protein coding transcripts in a gene. If none of the transcripts in a gene is protein coding, then it is the longest cDNA. "

Here is the rule that Ensembl uses, which is spelled out in a more complicated way, but is I believe essentially the same:

"The canonical transcript is used in the gene tree analysis in Ensembl and does not necessarily reflect the most biologically relevant transcript of a gene. For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript."

Basically you could look at the Protein_position column from the CSQ field (VEP) or the Amino_Acid_length column from the EFF field (SnpEff) to get the length of the CDS for the transcripts and sort from highest to lowest. I suppose you would need something to use as a tie-breaker -- maybe highest number of exons, then longest transcript, then pick random?

brentp commented 5 years ago

Hi @oleraj I'd gladly accept a PR for this.

oleraj commented 5 years ago

After doing some more digging, it looks like this was maybe already fixed by @roryk in geneimpacts (at least for VEP-annotated VCFs), though I haven't tested it yet. https://github.com/brentp/geneimpacts/commit/c1fd841c65a65e83be7fe1d1304785bb6db0642d#diff-ef46603b09e1d94334dfde203c2a72db This is in at least version 0.3.4 for geneimpacts; can we update this in the requirements file for gemini? Maybe geneimpacts>=0.3.6? Currently it's using 0.1.3.

jxchong commented 5 years ago

@brentp Once geneimpacts is updated in the newest release, does that mean GEMINI will default to loading the canonical impact and not the most severe? cc @oleraj

brentp commented 5 years ago

@jxchong. No. someone would have to implement an option (in gemini) to allow it to choose canonical transcripts).

arq5x commented 5 years ago

Hmm, I am not sure using the canonical impact is the best approach in the context of rare disease. I would rather manually refute candidates than miss them because they are on a different transcript.

jxchong commented 5 years ago

@arq5x Yes it's a mixed bag (why I asked). For discovery, I'd want the most severe impact, but for reporting out, I might prefer canonical transcript (it's not uncommon for us to sometimes pull out a severe impact in a incomplete/suspect/unsupported transcript and a synonymous change in all other transcripts)

roryk commented 5 years ago

Agreed, I don't really like using the canonical approach because you miss impactful variants. We missed a P53 variant because it was on a non-canonical transcript recently. But the canonical setting is useful when talking to other folks, I've found that clinicians know mutations by the amino acid change in the canonical transcript, so to be on the same page, we have to be talking about the same thing.

brentp commented 5 years ago

Just a reminder that the canonical transcript should be available in the variant_impacts table.