arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
318 stars 120 forks source link

support for additional VEP terms #926

Open jxchong opened 5 years ago

jxchong commented 5 years ago

Based on the findings of the DDD paper, we would like to be able to filter for the following variant annotations created by the VEP SpliceRegion plugin

splice_donor_5th_base_variant
splice_donor_region_variant
splice_polypyrimidine_tract_variant
extended_intronic_splice_region_variant_5prime
extended_intronic_splice_region_variant_3prime

Info here: http://www.ensembl.info/2018/10/26/cool-stuff-the-vep-can-do-splice-site-variant-annotation/ Plugin here: https://github.com/Ensembl/VEP_plugins/blob/release/94/SpliceRegion.pm

None of these annotations are currently listed in GEMINI's impacts column. How would we be able to access them when they don't have their own custom vep_xxx column (my understanding is that they are just provided by VEP as the annotation)? (right now we just do impact_severity<>'LOW' in GEMINI so I imagine we would have to do impact_severity<>'LOW' or xxxxx='yyy' or ...)

arq5x commented 5 years ago

I honestly think this is the realm of the new gemini workflow based upon vcfanno and vcf2db. Our goal is the switch over to this entirely this year.

jxchong commented 5 years ago

Thanks Aaron. If we switch to vcfanno/vcf2b right now, would these be accessible to us in queries?

arq5x commented 5 years ago

If they are in the VCF via vcfanno or VEP, they make it into the database. @brentp - can you corroborate?

brentp commented 5 years ago

I think these would be impacts in the CSQ string, right? e.g. instead of splice_variant it would now be splice_donor_5th_base_variant so we'd have to update the geneimpacts module.

An example VCF with a few variants would be helpful.

jxchong commented 5 years ago

Ok, we finally got this working in VEP and these show up in the CSQ string, but not in the Consequence field. They are instead in the SpliceRegionOutput field.

Here's an example. More examples in the VCF available here: https://www.dropbox.com/s/mg7u3nkxil7p4h5/spliceregionexamples.vcf.gz?dl=0

1    38272660    rs2291297    G    A    42583.1    PASS    AC=1;AF=0.224;AN=2;BaseQRankSum=-1.622;ClippingRankSum=0.271;DB;DP=3988;ExcessHet=0.4621;FS=0.528;InbreedingCoeff=0.1
309;MLEAC=43;MLEAF=0.224;MQ=9.49;MQ0=0;MQRankSum=0;QD=19.89;ReadPosRankSum=0.463;SOR=0.637;CSQ=A|downstream_gene_variant|MODIFIER|MTF1|ENSG00000188786|Transcript|ENST00000373036|protein_coding||||||||||rs2291297|2579|-1||HGNC|7428|YES|CCDS30676.1|1|C1orf122||||||||||||,A|upstream_gene_variant|MODIFIER|C1orf122|ENSG00000197982|Transcript|ENST00000373042|protein_coding|||||||||
|rs2291297|1158|1||HGNC|24789|YES|CCDS427.2||C1orf122||||||||||||,A|5_prime_UTR_variant|MODIFIER|C1orf122|ENSG00000197982|Transcript|ENST00000373043|protein_coding|1/2||ENST00000373043.1:c.
-1697G>A||10/2229|||||rs2291297||1||HGNC|24789||CCDS44112.1||C1orf122||||||||||||,A|intron_variant|MODIFIER|YRDC|ENSG00000196449|Transcript|ENST00000373044|protein_coding||2/4|ENST00000373044.2:c.505-12C>T|||||||rs2291297||-1||HGNC|28905|YES|CCDS30675.1||C1orf122||||||||||||splice_polypyrimidine_tract_variant,A|upstream_gene_variant|MODIFIER|C1orf122|ENSG00000197982|Transcrip
t|ENST00000419397|processed_transcript||||||||||rs2291297|672|1||HGNC|24789||||C1orf122||||||||||||,A|upstream_gene_variant|MODIFIER|C1orf122|ENSG00000197982|Transcript|ENST00000446260|prot
ein_coding||||||||||rs2291297|1422|1||HGNC|24789||||C1orf122||||||||||||,A|upstream_gene_variant|MODIFIER|C1orf122|ENSG00000197982|Transcript|ENST00000468084|protein_coding||||||||||rs22912
97|759|1||HGNC|24789||CCDS44112.1||C1orf122||||||||||||,A|regulatory_region_variant|MODIFIER|||RegulatoryFeature|ENSR00000004891|promoter||||||||||rs2291297|||||||||C1orf122||||||||||||    GT:AD:DP:GQ:PL    0/1:37,27:.:99:771,0,945
arq5x commented 5 years ago

Gotcha, looks like we would need to update the logic in geneimpacts and in vcf2db to support this.