arq5x / gemini

a lightweight db framework for exploring genetic variation.
http://gemini.readthedocs.org
MIT License
317 stars 120 forks source link

Adding custom scores to gemini #220

Open drmjc opened 10 years ago

drmjc commented 10 years ago

Hi Aaron, I've closely read https://github.com/arq5x/gemini/issues/161 and thought i'd start a related but separate thread. Also its worth pointing out that both VEP's pp2/sift and dbNSFP scores contain both known and unknown variants.

There's a dbNSFP plugin to VEP (https://raw.github.com/ensembl-variation/VEP_plugins/master/dbNSFP.pm). My plan is to add more functional prediction and conservation scores into my VEP-annotated VCF file. It might be worth adding the genic intolerance score as mentioned in another thread, and of course any number of other scores in the long term.

I'd love for gemini to be able to parse these fields new VCF fields, and allow filtering upon them. This would require new db fields, like hrt_pred, hrt_score, mutationtaster_pred, mutationtaster_score, etc.. I've had a look at the src code and I can see how you're parsing PP2 and SIFT scores from the VEP output, but doing this extensibly may be challenging (but not impossible).

This is potentially extensible by parsing this field:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Consequence|Codons|Amino_acids|Gene|HGNC|Feature|EXON|PolyPhen|SIFT|HRT|MutationTaster|GERP++">

then modifying EffectDetails function.

What are your thoughts on the feasibility of doing this? Is growing the repertoire of functional annotations in gemini a priority?

cheers, Mark

arq5x commented 10 years ago

Hi Mark,

Thanks for the suggestions. Yes, expanding upon functional annotations is a core focus. We are currently working through other changes to infrastructure and outstanding issues, but we will work on incorporating these scores asap. Our current challenge is standardizing the functionality differences b/w VEP and snpEff. Once we get a handle on that, we will start integrating more functional annotations, as these are clearly critical to interpretation.

drmjc commented 10 years ago

that's excellent.

I made some progress in the annotation space. So the VEP dbNSFP plugin does work very nicely. From this query

cd ~/apps/variant_effect_predictor_2.7
perl ~/apps/variant_effect_predictor_2.7/variant_effect_predictor.pl \
 -i example.vcf \
 -o example_VEP_dbNSFPall.vcf \
 --vcf \
 --force_overwrite \
 --cache \
 --plugin \
 dbNSFP,/share/ClusterShare/biodata/contrib/gi/dbNSFP_VEP/2.0/dbNSFP.gz,SIFT_score,Polyphen2_HDIV_score,Polyphen2_HDIV_pred,Polyphen2_HVAR_score,Polyphen2_HVAR_pred,LRT_score,LRT_pred,MutationTaster_score,MutationTaster_pred,MutationAssessor_score,MutationAssessor_pred,FATHMM_score,GERP++_NR,GERP++_RS,phyloP

Here's the resulting INFO header field:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|CELL_TYPE|GERP++_NR|Polyphen2_HDIV_score|MutationAssessor_pred|Polyphen2_HVAR_score|MutationTaster_score|LRT_score|LRT_pred|MutationAssessor_score|phyloP|SIFT_score|FATHMM_score|Polyphen2_HVAR_pred|MutationTaster_pred|Polyphen2_HDIV_pred|GERP++_RS">

i'll run the VEP query with the gemini-recommended flags + dbSNP, and report back.