Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 149 forks source link

Gnomad custom annotation vcf-format #1638

Open karoliinas opened 3 months ago

karoliinas commented 3 months ago

Dear ensembl-vep -team,

I'm using VEP for annotating variants aligned to chm13 with resources downloaded from https://github.com/marbl/CHM13.

I can add gene information from the chm13v2.0_GENCODEv35_CAT_Liftoff.vep.gff3.gz -file. When I try to add Gnomad allele frequencies using the file Homo_sapiens-GCA_009914755.4-2022_10-gnomad.vcf.gz, they are not added to the variants. VEP runs (for a long time) with no errors, but the output is identical to input when using only the gnomad -resource except for the gnomad_AF -descriptions in the header. I also tested removing VEP annotations from the provided gnomad -file and followed the instructions for annotating at https://www.ensembl.org/info/docs/tools/vep/script/vep_example.html.

Here is my command (just the gnomad -annotation): docker exec VEP vep -i $in_vcf -o $out_vcf --fasta /mnt/references/chm13/chm13v2.0_maskedY_rCRS.fa.gz --vcf --custom file=/mnt/references/chm13/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr_noVEP.vcf.gz,short_name=gnomad,format=vcf,type=exact,coords=0,fields=AC%AN%AF%AC_fin%AN_fin%AF_fin

In the header it adds:

VEP="v110" time="2024-03-15 15:34:18" ensembl-variation=110.d34d25e ensembl=110.584a8f3 ensembl-funcgen=110.24e6da6 ensembl-io=110.b1a0d57

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

It also adds the CSQ -field to the variants (denoting all variants as "intergenic_variant|MODIFIER") but there are no gnomad AF -fields. If I instead use as input the original gnomad -file annotated with VEP it includes the gene annotations, but still no allele frequencies.

I'm at my wits end, and would greatly appreciate any help on the matter. Granted I'm not that familiar with the files at https://github.com/marbl/CHM13, but understand they are lifted versions from GRCh38. Any suggestions regarding their use with VEP are welcome!

Thanks!

System

-VEP version 110 -Docker version 20.10.6 -Perl v5.30.0 -OS Ubuntu

Full VEP command line

docker exec VEP vep -i $in_vcf -o $out_vcf --fasta /mnt/references/chm13/chm13v2.0_maskedY_rCRS.fa.gz --vcf --custom file=/mnt/references/chm13/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr_noVEP.vcf.gz,short_name=gnomad,format=vcf,type=exact,coords=0,fields=AC%AN%AF%AC_fin%AN_fin%AF_fin

Full error message

Smartmatch is experimental at /opt/vep/src/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSource/File.pm line 472.

jamie-m-a commented 3 months ago

Hi @karoliinas

Sorry to hear you've been having trouble getting gnomAD info added to your variant data. Your command looks fine, and I was able to successfully run it and get frequency / count data added to small set of of test variants.

Would you be able to provide an example of you input VCF so I can investigate further? If your files a large snippet would be fine.

karoliinas commented 3 months ago

Hi @jamie-m-a, thanks for getting back so soon! And good to hear the script works with your data, I was afraid there might be a problem with the gnomad -file. Unfortunately I can't share my data, or at least will have to get some permissions first.

I've ran some tests with a subset of the data (small region and 1 sample instead of 57). I do wonder why it adds the CSQ field and also that doesn't have the gnomad_ prefix. Perhaps this is a default field with vep? Thanks again, I'll get back when I have some more information (on sharing the data).

jamie-m-a commented 3 months ago

Hi @karoliinas

I understand, we can try some things to test without your data:

If you save this as a VCF and try running again with it as the input:

##fileformat=VCFv4.1 #CHROM POS ID REF ALT QUAL FILTER INFO 1 5647 . G C . . . 1 5689 . C T . . . 1 5703 . TG T . . .

These are gnomAD variants so if these have their scores added correctly for you (as they are for me), we can say it's likely that there's some issue with your input VCF.

karoliinas commented 3 months ago

Works like a charm:

##fileformat=VCFv4.1
##VEP="v110" time="2024-03-18 15:44:55" ensembl-funcgen=110.24e6da6 ensembl-io=110.b1a0d57 ensembl=110.584a8f3 ensembl-variation=110.d34d25e
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|SOURCE|gnomad|gnomad_AC|gnomad_AN|gnomad_AF|gnomad_AC_fin|gnomad_AN_fin|gnomad_AF_fin">
##INFO=<ID=gnomad,Number=.,Type=String,Description="[PATH]/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr.vcf.gz">
##INFO=<ID=gnomad_AC,Number=.,Type=String,Description="AC field from [PATH]/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr.vcf.gz">
##INFO=<ID=gnomad_AN,Number=.,Type=String,Description="AN field from [PATH]/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr.vcf.gz">
##INFO=<ID=gnomad_AF,Number=.,Type=String,Description="AF field from [PATH]/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr.vcf.gz">
##INFO=<ID=gnomad_AC_fin,Number=.,Type=String,Description="AC_fin field from [PATH]/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr.vcf.gz">
##INFO=<ID=gnomad_AN_fin,Number=.,Type=String,Description="AN_fin field from [PATH]/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr.vcf.gz">
##INFO=<ID=gnomad_AF_fin,Number=.,Type=String,Description="AF_fin field from [PATH]/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr.vcf.gz">
##VEP-command-line='vep --custom [PATH]/Homo_sapiens-GCA_009914755.4-2022_10-gnomad_chr.vcf.gz,short_name=gnomad,format=vcf,type=exact,coords=0,fields=AC%AN%AF%AC_fin%AN_fin%AF_fin --database 0 --dir_plugins /plugins --fasta [PATH]/chm13v2.0_maskedY_rCRS.fasta --input_file [PATH]/test2.vcf --output_file [PATH]/test2_gnomad.vcf --vcf'
#CHROM POS ID REF ALT QUAL FILTER INFO
1   5647    .   G   C   .   .   CSQ=C|intergenic_variant|MODIFIER||||||||||||||||||||||rs1366377162|6|44590|0.000134559|0|2850|0
1   5689    .   C   T   .   .   CSQ=T|intergenic_variant|MODIFIER||||||||||||||||||||||rs1443549584|1|44040|2.27066e-05|0|2766|0
1   5703    .   TG  T   .   .   CSQ=-|intergenic_variant|MODIFIER||||||||||||||||||||||rs1353644827|0|39590|0|0|2494|0

So it must be something with my data. I'll continue with the puzzle, thank you so much for your help! I'll get back when I figure something out in case someone else has similar issues.