Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
456 stars 152 forks source link

Customized Sift/Polyphen scores with VEP #772

Closed eyeamnice closed 4 years ago

eyeamnice commented 4 years ago

I downloaded cache files from ensemblgenomes which does not contain sift/polyphen scores. The info.txt files in the cache directory does not contain sift b or polyphen b. How do I generate those scores and make sure VEP can use the scores for annotation? Specifically, what are the steps to follow to customize VEP to annotate my variants with those scores?

at7 commented 4 years ago

Hello,

we calculate SIFT scores for a selection of species listed here. For which species would you like to obtain SIFT scores? Maybe it would be possible to include them in a future release. If you have annotations available outside of the cache files you could make use of VEP's custom annotation option which is described here.

Best wishes, Anja

eyeamnice commented 4 years ago

Actually, I am working with plants. I need to calculate scores for Glycine max. I have looked at the link you provided, but nothing describes how I can use my own calculated scores. How do I ensure that VEP uses any scores I may have calculated? I also thought VEP recognizes to annotate with a score only if the info.txt file contains entries of shift b and polyphen b. If possible, could you provide me with an example command line running vep for a non-human specie that uses a user calculated sift score?

Regards, Peter

at7 commented 4 years ago

Dear Peter, we run an internal pipeline to calculate SIFT scores for all possible missense variants for a set of species. The scores are stored in our databases and where available dumped into VEP cache files. I can check if there are any plans to create a variation database for Glycine max in the future.

I'm also sorry for providing the wrong link to our custom annotation documentation. The correct link is here. If you have your own scores it would be worth investigating if you could store them in VCF format for example. Then you can make use of VEP's custom annotation option.

Here is an example for running VEP for non-vertebrate species. Just go to the info box: Running VEP on non-vertebrate species.

Best wishes, Anja

eyeamnice commented 4 years ago

Hi Anja, I am wondering if it was possible to put out information on how VEP calls or is able to find and use the calculated SIFT scores stored in your database. Since I am working on an organism that you do not have scores for, I have generated a sift database and calculated the score. Right now, VEP still does not use these scores. In what format are the scores stored or dumped in the cache files for VEP to recognize? Here is a sample line from my VEP run #Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM REF_ALLELE IMPACT DISTANCE STRAND FLAGS VARIANT_CLASS CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC SOURCE SIFT PolyPhen DOMAINS AF CLIN_SIG SOMATIC PHENO Glycine_max.gff3.gz ss3563636428 15:11691 A GLYMA_15G000100 KRH09582 Transcript upstream_gene_variant - - - - - - 1 C MODIFIER 3823 -1 - SNV - - KRH09582 - - - Glycine_max.gff3.gz - - - - - - - - The documentation does not go into details of how to use your own SIFT scores after creating a sift database and generating a score. For example, VEP only performs a sift/polyphen score annotation only if the info.txt file in the cache directory contains sift b or polyphen b. How is that file generated? In essence, what is the process of generating the cache directory that VEP recognizes?

I am just trying to understand the whole process, that way, makes it easy to use VEP for custom annotation when you need consequence predicted

eyeamnice commented 4 years ago

I also noticed that issue 765 raised a question about creating cache files. If more people are asking such questions, maybe it will be helpful to create some documentation that will help. That way users can create their own cache files if they have all the data required especially for non-model organisms to run custom VEP

at7 commented 4 years ago

Hi Peter, you can read about how we store the SIFT scores here. When we generate our VEP cache files we dump the prediction matrix for each translation into the cache files.

I think for your use case it would be worth trying to store your SIFT scores in a VCF file and then use our custom annotation from VCF file option.

We will definitely discuss how we can make our VEP cache dumping pipeline more accessible to our users. At the moment we only have our internal pipeline which depends on the ensembl-hive job scheduling API.

Best wishes, Anja

at7 commented 4 years ago

Dear Peter,

I was reminded of a VEP plugin which retrieves PolyPhen and SIFT predictions from a locally constructed sqlite database. Here is a module from our internal pipeline which is used to populate a database with SIFT results and could point you the code you need to populate your own database. In which format are your SIFT results currently?

Best wishes, Anja

eyeamnice commented 4 years ago

Hi Anja,

My sift results are in text format currently. They are tab delimited. I can also generate sift scores in vcf format. I should be okay with a code that can help me populate the database from any of the sift score format (text or vcf).

at7 commented 4 years ago

Dear Peter,

did you get your SIFT scores from https://sift.bii.a-star.edu.sg/sift4g/public//Glycine_max/? If that is the case have you tried using the VCF annotation option: https://sift.bii.a-star.edu.sg/sift4g/AnnotateVariants.html? You could run VEP first and then use the output, which you can specify to be VCF, as input for the SIFT annotation step.

Best wishes, Anja

eyeamnice commented 4 years ago

Thanks @at7 for that information and yes I used the SIFT score from the link you showed. I think that is doable. I will give it a try. If that works, how do I generate the summary report in html format? Will that still be possible? You know when I run VEP first, it generates the html summary. By the time I run sift on the output of VEP, I will not have an updated summary report. Do you know if I can generate that in a standalone method?

eyeamnice commented 4 years ago

Hi Anja, I used the output of vep as input to sift4g but it did not generate the desired result. The results from vep were not annotated in the SIFT result. At this point, I am not sure what to try next.

at7 commented 4 years ago

I'm not familiar with sift4G. However, there are two things you could check: Is your VEP output file a VCF file? Did you run VEP with --vcf? And secondly, it could be related to an assembly version difference. The latest version in https://sift.bii.a-star.edu.sg/sift4g/public//Glycine_max/ is V1.0.28. But the latest assembly version in Ensembl is already V2.1. I wonder if you can create SIFT files for the latest assembly if you haven't already or if you can use an earlier version of your GFF file which matches V1.0.28? Please let me know with which version of SIFT files and with which assembly version you are working.

eyeamnice commented 4 years ago

Hi Anja, My VEP output and SIFT outputs are vcf files and both are now the same version - V2.1 after I generated sift result for V2.1. Assuming I want to use bed file for the annotation, VEP has an input format shown below:

1    10000    11000    Feature1
3    25000    26000    Feature2
X    99000    99001    Feature3

I tried a format with sift score results where column 4 are the sift scores in this form:

1    10000    10000    0.01
1    25000    25000    0.2
1    99000    99000    0 
2    11000    11000    0.11
2    25000    25000    0.32
3    22000    22000    0.10

and used

--custom mysiftscore.bed.gz,bed,,SIFT, 0

in my vep command. The title SIFT was added to the header of the output but the scores were not added from the bed file to the SIFT column for any of the transcripts. Do you know the reason why? I verified that the positions exist in the tab format output of VEP but it was appended with the chromosome number e.g 1:10000, 1:25000.

at7 commented 4 years ago

Have you tried --custom mysiftscore.bed.gz,bed,exact,0? You also need to be careful how you provide the coordinates in your bed file. The start is 0 based and the end is not included. For the first row in your example bed file it would mean: 1 9999 10000. And then you can use the custom option: --custom mysiftscore.bed.gz,SIFT,bed,exact,0 and in your output you should see SIFT=0.01.

However, a better option is to use custom annotation from a VCF file. In a VCF file you can also specify the base change and link it to the score as provided by the sift file coming from the sift4g tool.

For example here is my example for human: My custom annotation file: sift_scores.vcf.gz

fileformat=VCFv4.1

CHROM POS ID REF ALT QUAL FILTER INFO

1 230710048 . A G . . SIFT=0.5

Run VEP: ./vep --input_file rs699.vcf --output_file rs699.out \ --fasta Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \ --gtf Homo_sapiens.GRCh38.100.chr.gtf.gz \ --custom sift_scores.vcf.gz,sift,vcf,exact,0,SIFT

My output looks like: rs699 1:230710048 G ENSG00000135744 ENST00000366667 Transcript missense_variant 843 803 268 M/T aTg/aCg - IMPACT=MODERATE;STRAND=-1;SOURCE=Homo_sapiens.GRCh38.100.chr.gtf.gz;sift=1:230710048-230710048;sift_FILTER=.;sift_SIFT=0.5

eyeamnice commented 4 years ago

I wrote a custom script to integrate SIFT scores into VEP result. I will mark this as resolved based on the output. I wish I could still generate a html summary report after adding the SIFT score. I may have to write another custom script for that.

at7 commented 4 years ago

I'm happy to hear that you successfully added the SIFT scores to your VEP results. We don't have any plans to customise the html summary report script for including scores that have been added after VEP was run. I did check for any available tools that could help you with this and found vcfstats. I haven't worked with vcfstats before but from reading the documentation I think that it might be able to help you to generate the summary report for your SIFT score annotations.