Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
456 stars 153 forks source link

No INFO header added to .vcf with the extra info flags used #1322

Closed ESDeutekom closed 8 months ago

ESDeutekom commented 1 year ago

Describe the issue

I want to annotate my .vcf (Called with DeepVariant) with additional AF values. There are indeed extra things added to the INFO fields (see output file example), but not the header to explain what is in the extra fields.

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

Full VEP command line

vep --dir_cache /variants/cache_vep --vcf -i /results/variant_calling/RM8398_MF_S8.vcf -o /results/variant_annotation/RM8398_MF_S8.annotated.vcf.gz --af --af_1kg --af_gnomade --sift b --polyphen b --gene_phenotype --fork 20 --offline --compress_output gzip

Data files (if applicable)

I think the files might be too big, but here are the important parts (I think) with one example annotation.

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##FORMAT=<ID=MED_DP,Number=1,Type=Integer,Description="Median DP observed within the GVCF block rounded to the nearest integer.">
##DeepVariant_version=1.4.0
...
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE_RM
chrM    16023   .   G   A   45.4    PASS    CSQ=A|downstream_gene_variant|MODIFIER|MT-ND4|ENSG00000198886|Transcript|ENST00000361381|protein_coding||||||||||rs55934780|3886|1||HGNC|HGNC:7459|1||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|MT-ND5|ENSG00000198786|Transcript|ENST00000361567|protein_coding||||||||||rs55934780|1875|1||HGNC|HGNC:7461|1||||||||||||||||||||,A|upstream_gene_variant|MODIFIER|MT-ND6|ENSG00000198695|Transcript|ENST00000361681|protein_coding||||||||||rs55934780|1350|-1||HGNC|HGNC:7462|1||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|MT-CYB|ENSG00000198727|Transcript|ENST00000361789|protein_coding||||||||||rs55934780|136|1||HGNC|HGNC:7427|1||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|MT-TH|ENSG00000210176|Transcript|ENST00000387441|Mt_tRNA||||||||||rs55934780|3817|1||HGNC|HGNC:7487|1||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|MT-TS2|ENSG00000210184|Transcript|ENST00000387449|Mt_tRNA||||||||||rs55934780|3758|1||HGNC|HGNC:7498|1||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|MT-TL2|ENSG00000210191|Transcript|ENST00000387456|Mt_tRNA||||||||||rs55934780|3687|1||HGNC|HGNC:7491|1||||||||||||||||||||,A|upstream_gene_variant|MODIFIER|MT-TE|ENSG00000210194|Transcript|ENST00000387459|Mt_tRNA||||||||||rs55934780|1281|-1||HGNC|HGNC:7479|1||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|MT-TT|ENSG00000210195|Transcript|ENST00000387460|Mt_tRNA||||||||||rs55934780|70|1||HGNC|HGNC:7499|1||||||||||||||||||||,A|non_coding_transcript_exon_variant|MODIFIER|MT-TP|ENSG00000210196|Transcript|ENST00000387461|Mt_tRNA|1/1||||1|||||rs55934780||-1||HGNC|HGNC:7494|1||||||||||||||||||||  GT:GQ:DP:AD:VAF:PL  0/1:43:131:67,63:0.480916:45,0,47
diegomscoelho commented 1 year ago

Hi @ESDeutekom,

Unfortunately our cache don't have frequency data for chrM. Since your VCF does not have a variant with frequency INFO, description about INFO header wasn't included in your header. VEP does include INFO fields in CSQ but doesn't include their description when output is a VCF. You can check descriptions here (https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#vcfout) and also can use VEP without --vcf to get a printed version of all descriptions.

Regards, @diegomscoelho

ESDeutekom commented 1 year ago

Hi @diegomscoelho, Thank you for your reply. My data does actually contain all the other chr's. But I apperently just took the bad example.

So a better example:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Sample_X
chr1    13613   .   T   A   11  PASS    CSQ=A|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|6/6||||575|||||rs879980801||1||HGNC

So, I would expect then, if I read your first comment correctly, that there should be a description for the INFO header? I have called variants over the whole human genome, so there must be some frequency data. And I do see additional data in the INFO columns. I just do not know what it is. But according to your second comment, there is no description at all in a vcf if I use parameters like --af etc.?

diegomscoelho commented 1 year ago

Hi @ESDeutekom,

No, VEP using --vcf flag will include one INFO line INFO=CSQ containing all columns names related to your output without a description for each column added, i.e:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|CLIN_SIG|SOMATIC|PHENO">

If you use VEP without --vcf output, you will have all descriptions posted in the header of file.

Best, @diegomscoelho

ESDeutekom commented 1 year ago

Dear @diegomscoelho,

thank you for your reply.

[...] --vcf flag will include one INFO line INFO=CSQ containing all columns names related to your output

Great! This is the point I was trying to make originally. So, what is going on with my INFO header line then? It does not contain what you say it should. And that is indeed why I want to know what is up, am I missing a flag? (see full VEP command in original post)

My line using --vcf --af [etc] (see original post): ##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">

You line using --vcf [etc]: ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|CLIN_SIG|SOMATIC|PHENO">

Kind regards, Eva

diegomscoelho commented 1 year ago

Hi @ESDeutekom,

When you use --vcf flag in your vep command, VEP will only include lines:

##fileformat=VCFv4.1
##VEP="v108" ...
##INFO=<ID=CSQ,...
##VEP-command-line=...

So following your last input as example:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE_X
chr1    13613   .   T   A   .   .   .

I get this:

##fileformat=VCFv4.1
##VEP="v108" ...
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|...
##VEP-command-line='vep --i input.vcf.gz --vcf --no_stats --af ...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE_X
chr1    13613   .       T       A       .       .       CSQ=A|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|...

Looking your first output, ##INFO=<ID=END should already be part of your input before running VEP. Besides that you omitted part of your output after ##DeepVariant_version=1.4.0 not sure if VEP output wan't included. Can you send me a small VCF input + VEP output so I can be sure what are you meaning by no INFO header?

Regards, @diegomscoelho

ESDeutekom commented 1 year ago

Dear @diegomscoelho,

Thank you. Here are two snippets (first 3000 lines) of the .vcf input snippet_input.txt and vep .vcf annotated output snippet_output.txt. I had to convert them to .txt because .vcf is not supported. Hope that is okay.

And what I meant was, my input and output do contain a INFO header. But with no explanation of what is in the INFO (and thus the INFO column).

So the input file has

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
^^^^^^^

And

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  RM8398_EF_S16
chr1    13613   .   T   A   14.3    PASS    .   GT:GQ:DP:AD:VAF:PL  0/1:10:3:1,2:0.666667:13,0,10

The output file has:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
^^^^^^^

And

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  RM8398_EF_S16
chr1    13613   .   T   A   14.3    PASS    CSQ=A|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|6/6||||575|||||rs879980801||1||HGNC|HGNC:37102|||||||||||||||||||||,A|non_coding_transcript_exon_variant|MODIFIER|DDX11L2|ENSG00000290825|Transcript|ENST00000456328|lncRNA|3/3||||861|||||rs879980801||1||EntrezGene||||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene||||||||||rs879980801|791|-1||HGNC|HGNC:38034|||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|MIR6859-1|ENSG00000278267|Transcript|ENST00000619216|miRNA||||||||||rs879980801|3756|-1||HGNC|HGNC:50039|||||||||||||||||||||  GT:GQ:DP:AD:VAF:PL  0/1:10:3:1,2:0.666667:13,0,10

So your example has a more elaborate INFO header, which I coppied again below:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|CLIN_SIG|SOMATIC|PHENO">

^^^^ this header explains what is shown in the INFO column

Thanks again.

diegomscoelho commented 1 year ago

Hi @ESDeutekom,

Sorry for this late response. Your snippet_output.txt file has all header that should be added by VEP:

##VEP="v108" time="2022-12-22 12:32:50" cache="/mnt/scratch_dir/deutekoe/projects/HP/WESPipe/workflow/resources/variants/cache_vep/homo_sapiens/108_GRCh38" ensembl-io=108.58d13c1 ensembl-funcgen=108.56bb136 ensembl-variation=108.a885ada ensembl=108.d8a9c80 1000genomes="phase3" COSMIC="96" ClinVar="202205" HGMD-PUBLIC="20204" assembly="GRCh38.p13" dbSNP="154" gencode="GENCODE 42" genebuild="2014-07" gnomADe="r2.1.1" gnomADg="v3.1.2" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|GENE_PHENO|SIFT|PolyPhen|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|CLIN_SIG|SOMATIC|PHENO">
##VEP-command-line='vep --dir_cache /mnt/scratch_dir/deutekoe/projects/HP/WESPipe/workflow/resources/variants/cache_vep --vcf -i WESP_run/results/variant_calling/RM8398_EF_S16.vcf -o WESP_run/results/variant_annotation/RM8398_EF_S16.annotated.vcf.gz --af --af_1kg --af_gnomade --sift b --polyphen b --gene_phenotype --fork 24 --offline --compress_output gzip'
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  RM8398_EF_S16

Description of those columns as ##INFO=ID... in the header would not be included using --vcf output flag. You can have information about those columns here.

Hope this can help you, @diegomscoelho

jamie-m-a commented 8 months ago

I think this issue is now resolved so I'm going to close it off, but please reach out to us if you have any more questions.