Closed ESDeutekom closed 8 months ago
Hi @ESDeutekom,
Unfortunately our cache don't have frequency data for chrM
. Since your VCF does not have a variant with frequency INFO, description about INFO header wasn't included in your header.
VEP does include INFO fields in CSQ but doesn't include their description when output is a VCF. You can check descriptions here (https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#vcfout) and also can use VEP without --vcf
to get a printed version of all descriptions.
Regards, @diegomscoelho
Hi @diegomscoelho, Thank you for your reply. My data does actually contain all the other chr's. But I apperently just took the bad example.
So a better example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample_X
chr1 13613 . T A 11 PASS CSQ=A|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|6/6||||575|||||rs879980801||1||HGNC
So, I would expect then, if I read your first comment correctly, that there should be a description for the INFO header? I have called variants over the whole human genome, so there must be some frequency data. And I do see additional data in the INFO columns. I just do not know what it is. But according to your second comment, there is no description at all in a vcf if I use parameters like --af etc.?
Hi @ESDeutekom,
No, VEP using --vcf
flag will include one INFO line INFO=CSQ
containing all columns names related to your output without a description for each column added, i.e:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|CLIN_SIG|SOMATIC|PHENO">
If you use VEP without --vcf
output, you will have all descriptions posted in the header of file.
Best, @diegomscoelho
Dear @diegomscoelho,
thank you for your reply.
[...] --vcf flag will include one INFO line INFO=CSQ containing all columns names related to your output
Great! This is the point I was trying to make originally. So, what is going on with my INFO header line then? It does not contain what you say it should. And that is indeed why I want to know what is up, am I missing a flag? (see full VEP command in original post)
My line using --vcf --af [etc] (see original post):
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
You line using --vcf [etc]:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|CLIN_SIG|SOMATIC|PHENO">
Kind regards, Eva
Hi @ESDeutekom,
When you use --vcf
flag in your vep command, VEP will only include lines:
##fileformat=VCFv4.1
##VEP="v108" ...
##INFO=<ID=CSQ,...
##VEP-command-line=...
So following your last input as example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE_X
chr1 13613 . T A . . .
I get this:
##fileformat=VCFv4.1
##VEP="v108" ...
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|...
##VEP-command-line='vep --i input.vcf.gz --vcf --no_stats --af ...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE_X
chr1 13613 . T A . . CSQ=A|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|...
Looking your first output, ##INFO=<ID=END
should already be part of your input before running VEP. Besides that you omitted part of your output after ##DeepVariant_version=1.4.0
not sure if VEP output wan't included. Can you send me a small VCF input + VEP output so I can be sure what are you meaning by no INFO header
?
Regards, @diegomscoelho
Dear @diegomscoelho,
Thank you. Here are two snippets (first 3000 lines) of the .vcf input snippet_input.txt and vep .vcf annotated output snippet_output.txt. I had to convert them to .txt because .vcf is not supported. Hope that is okay.
And what I meant was, my input and output do contain a INFO header. But with no explanation of what is in the INFO (and thus the INFO column).
So the input file has
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
^^^^^^^
And
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RM8398_EF_S16
chr1 13613 . T A 14.3 PASS . GT:GQ:DP:AD:VAF:PL 0/1:10:3:1,2:0.666667:13,0,10
The output file has:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
^^^^^^^
And
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RM8398_EF_S16
chr1 13613 . T A 14.3 PASS CSQ=A|non_coding_transcript_exon_variant|MODIFIER|DDX11L1|ENSG00000223972|Transcript|ENST00000450305|transcribed_unprocessed_pseudogene|6/6||||575|||||rs879980801||1||HGNC|HGNC:37102|||||||||||||||||||||,A|non_coding_transcript_exon_variant|MODIFIER|DDX11L2|ENSG00000290825|Transcript|ENST00000456328|lncRNA|3/3||||861|||||rs879980801||1||EntrezGene||||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|WASH7P|ENSG00000227232|Transcript|ENST00000488147|unprocessed_pseudogene||||||||||rs879980801|791|-1||HGNC|HGNC:38034|||||||||||||||||||||,A|downstream_gene_variant|MODIFIER|MIR6859-1|ENSG00000278267|Transcript|ENST00000619216|miRNA||||||||||rs879980801|3756|-1||HGNC|HGNC:50039||||||||||||||||||||| GT:GQ:DP:AD:VAF:PL 0/1:10:3:1,2:0.666667:13,0,10
So your example has a more elaborate INFO header, which I coppied again below:
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|CLIN_SIG|SOMATIC|PHENO">
^^^^ this header explains what is shown in the INFO column
Thanks again.
Hi @ESDeutekom,
Sorry for this late response. Your snippet_output.txt
file has all header that should be added by VEP:
##VEP="v108" time="2022-12-22 12:32:50" cache="/mnt/scratch_dir/deutekoe/projects/HP/WESPipe/workflow/resources/variants/cache_vep/homo_sapiens/108_GRCh38" ensembl-io=108.58d13c1 ensembl-funcgen=108.56bb136 ensembl-variation=108.a885ada ensembl=108.d8a9c80 1000genomes="phase3" COSMIC="96" ClinVar="202205" HGMD-PUBLIC="20204" assembly="GRCh38.p13" dbSNP="154" gencode="GENCODE 42" genebuild="2014-07" gnomADe="r2.1.1" gnomADg="v3.1.2" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|GENE_PHENO|SIFT|PolyPhen|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|CLIN_SIG|SOMATIC|PHENO">
##VEP-command-line='vep --dir_cache /mnt/scratch_dir/deutekoe/projects/HP/WESPipe/workflow/resources/variants/cache_vep --vcf -i WESP_run/results/variant_calling/RM8398_EF_S16.vcf -o WESP_run/results/variant_annotation/RM8398_EF_S16.annotated.vcf.gz --af --af_1kg --af_gnomade --sift b --polyphen b --gene_phenotype --fork 24 --offline --compress_output gzip'
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RM8398_EF_S16
Description of those columns as ##INFO=ID...
in the header would not be included using --vcf
output flag. You can have information about those columns here.
Hope this can help you, @diegomscoelho
I think this issue is now resolved so I'm going to close it off, but please reach out to us if you have any more questions.
Describe the issue
I want to annotate my .vcf (Called with DeepVariant) with additional AF values. There are indeed extra things added to the INFO fields (see output file example), but not the header to explain what is in the extra fields.
Additional information
Please fill in the following sections to help us find the source of your issue as quickly as possible.
System
Full VEP command line
Data files (if applicable)
I think the files might be too big, but here are the important parts (I think) with one example annotation.