Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
452 stars 151 forks source link

Unparseable VEP output #1351

Closed pd3 closed 3 days ago

pd3 commented 1 year ago

Describe the issue

When VEP adds the LoF_info annotation, it does not sanitize its output and allows commas in the LoF_info subfield. For example:

[...]|PERCENTILE:0.7,GERP_DIST:-366.3[...]MAX_ORF:-698.745,GA|frameshift_variant[...]

This makes it impossible to split the consequences by transcript and variant, programs that are designed to extract and query VEP annotations fail (such as bcftools +split-vep).

Additional information

Example of such file and site is chr21:5032064 in https://gnomad-public-us-east-1.s3.amazonaws.com/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz

System

VEP version: 101 (possibly more recent versions as well)

An example of such VEP annotation

header:
##INFO=<ID=vep,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|ALLELE_NUM|DISTANCE|STRAND|VARIANT_CLASS|MINIMISED|SYMBOL_SOURCE|HGNC_ID|CANONICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|GENE_PHENO|SIFT|PolyPhen|DOMAINS|HGVS_OFFSET|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF|LoF_filter|LoF_flags|LoF_info">
##vep_version=v101
data line:
vep=GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000612610|protein_coding|5/7||ENST00000612610.4:c.718_719dup|ENSP00000483732.1:p.Asp240GlufsTer35|896-897|709-710|237|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene|||1|A2||ENSP00000483732|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.773118279569892,GERP_DIST:-366.377766615897,BP_DIST:218,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000620481|protein_coding|4/6||ENST00000620481.4:c.367_368dup|ENSP00000484302.1:p.Asp123GlufsTer35|545-546|358-359|120|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene|||5|||ENSP00000484302|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.635578583765112,GERP_DIST:-366.377766615897,BP_DIST:218,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000623795|protein_coding|4/6||ENST00000623795.1:c.367_368dup|ENSP00000485649.1:p.Asp123GlufsTer35|505-506|358-359|120|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene|||2|||ENSP00000485649|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.659498207885305,GERP_DIST:-372.525567065179,BP_DIST:197,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|3_prime_UTR_variant&NMD_transcript_variant|MODIFIER|FP565260.3|ENSG00000277117|Transcript|ENST00000623903|nonsense_mediated_decay|5/7||ENST00000623903.3:c.*332_*333dup||706-707|||||1||1|insertion||Clone_based_ensembl_gene|||2|||ENSP00000485557||||||||||||||||,GA|frameshift_variant|HIGH|FP565260.3|ENSG00000277117|Transcript|ENST00000623960|protein_coding|5/7||ENST00000623960.4:c.718_719dup|ENSP00000485129.1:p.Asp240GlufsTer35|858-859|709-710|237|G/GX|gga/gGAga|1||1|insertion||Clone_based_ensembl_gene||YES|1|P2|CCDS86973.1|ENSP00000485129|||||||PANTHER:PTHR24100&PANTHER:PTHR24100|10|||||HC||PHYLOCSF_WEAK|PERCENTILE:0.790979097909791,GERP_DIST:-372.525567065179,BP_DIST:197,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,ANN_ORF:-698.745,MAX_ORF:-698.745,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|NM_001363770.2|protein_coding|5/7||NM_001363770.2:c.718_719dup|NP_001350699.1:p.Asp240GlufsTer35|858-859|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||YES||||NP_001350699.1||||||||10|||||HC|||PERCENTILE:0.790979097909791,GERP_DIST:-372.525567065179,BP_DIST:197,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|XM_006723899.2|protein_coding|5/6||XM_006723899.2:c.718_719dup|XP_006723962.1:p.Asp240GlufsTer35|1345-1346|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||||||XP_006723962.1||||||||10|||||HC|||PERCENTILE:0.463571889103804,GERP_DIST:-1141.14512844086,BP_DIST:840,DIST_FROM_LAST_EXON:152,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|XM_011546078.2|protein_coding|5/7||XM_011546078.2:c.718_719dup|XP_011544380.1:p.Asp240GlufsTer35|1345-1346|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||||||XP_011544380.1||||||||10|||||HC|||PERCENTILE:0.662062615101289,GERP_DIST:-354.294564935565,BP_DIST:374,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT,GA|frameshift_variant|HIGH|LOC102723996|102723996|Transcript|XM_011546079.1|protein_coding|5/7||XM_011546079.1:c.718_719dup|XP_011544381.1:p.Asp240GlufsTer35|1345-1346|709-710|237|G/GX|gga/gGAga|1||1|insertion||EntrezGene||||||XP_011544381.1||||||||10|||||HC|||PERCENTILE:0.785792349726776,GERP_DIST:-391.862466733158,BP_DIST:203,DIST_FROM_LAST_EXON:187,50_BP_RULE:PASS,PHYLOCSF_TOO_SHORT

Proposed solution

Replace commas and other special characters in plugins' outputs with the corresponding percent encoded characters, as recommended by the VCF specification v4.3 in section 1.2 (http://samtools.github.io/hts-specs/VCFv4.3.pdf).

dglemos commented 1 year ago

Hi @pd3, You are correct, the output field has to be refactored to not include the commas. Can you please confirm you are getting this results from the loftee plugin? If so the issue should be raised in the loftee repo as it is developed there.

Best wishes, Diana

pd3 commented 1 year ago

I would assume so, but the file was not produced by me, it's gnomAD VCFs.

As for the responsibility, I would argue that both Loftee and VEP should do something about it: Loftee should not produce such output in the first place and VEP should sanitize outputs from all its plugins to prevent problems like this.

It's unfortunate that this went unnoticed and a major resource got affected.

dglemos commented 1 year ago

At the moment VEP does not sanitize the data returned by any of the plugins. We ask anyone developing a plugin to test how the output is displayed for each format and ensure the data is parsable. We understand this is not the ideal solution however an extra parsing of the output can have a significant impact on VEP performance for some of the plugins.

pd3 commented 1 year ago

We understand this is not the ideal solution however an extra parsing of the output can have a significant impact on VEP performance for some of the plugins.

I understand why it's a nuisance. But I don't believe that sanitizing plugins output can have a noticeable performance effect, if done well. Have you done any benchmarking to support that claim?

dglemos commented 3 days ago

Hi @pd3, Thank you for your feedback, and I completely understand your perspective. While we haven’t conducted specific benchmarking for this particular scenario, our experience with similar cases has shown that any additional parsing, even when optimized, tends to increase runtime. We acknowledge the importance of ensuring output data quality and are committed to being more vigilant with plugin outputs in future integrations.

Best wishes, Diana