Closed pd3 closed 3 days ago
Hi @pd3, You are correct, the output field has to be refactored to not include the commas. Can you please confirm you are getting this results from the loftee plugin? If so the issue should be raised in the loftee repo as it is developed there.
Best wishes, Diana
I would assume so, but the file was not produced by me, it's gnomAD VCFs.
As for the responsibility, I would argue that both Loftee and VEP should do something about it: Loftee should not produce such output in the first place and VEP should sanitize outputs from all its plugins to prevent problems like this.
It's unfortunate that this went unnoticed and a major resource got affected.
At the moment VEP does not sanitize the data returned by any of the plugins. We ask anyone developing a plugin to test how the output is displayed for each format and ensure the data is parsable. We understand this is not the ideal solution however an extra parsing of the output can have a significant impact on VEP performance for some of the plugins.
We understand this is not the ideal solution however an extra parsing of the output can have a significant impact on VEP performance for some of the plugins.
I understand why it's a nuisance. But I don't believe that sanitizing plugins output can have a noticeable performance effect, if done well. Have you done any benchmarking to support that claim?
Hi @pd3, Thank you for your feedback, and I completely understand your perspective. While we haven’t conducted specific benchmarking for this particular scenario, our experience with similar cases has shown that any additional parsing, even when optimized, tends to increase runtime. We acknowledge the importance of ensuring output data quality and are committed to being more vigilant with plugin outputs in future integrations.
Best wishes, Diana
Describe the issue
When VEP adds the
LoF_info
annotation, it does not sanitize its output and allows commas in the LoF_info subfield. For example:This makes it impossible to split the consequences by transcript and variant, programs that are designed to extract and query VEP annotations fail (such as
bcftools +split-vep
).Additional information
Example of such file and site is chr21:5032064 in https://gnomad-public-us-east-1.s3.amazonaws.com/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz
System
VEP version: 101 (possibly more recent versions as well)
An example of such VEP annotation
Proposed solution
Replace commas and other special characters in plugins' outputs with the corresponding percent encoded characters, as recommended by the VCF specification v4.3 in section 1.2 (http://samtools.github.io/hts-specs/VCFv4.3.pdf).