Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 149 forks source link

NCBI Fasta causing somewhat arbitrary renaming of chromosomes and thus silent dropping of plugin annotations #1635

Closed davmlaw closed 2 months ago

davmlaw commented 3 months ago

If I use a NCBI fasta file that has identifiers like "NC_000003.11" rather than "3" for contigs, then it occasionally renames variants to use the contig from the fasta

# NCBI fasta file obtained via:
FASTA_FILE=GCF_000001405.39_GRCh38.p13_genomic.fna.gz
wget --quiet -O - https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.39_GRCh38.p13/${FASTA_FILE} | gzip -d | bgzip > ${FASTA_FILE}
samtools faidx ${FASTA_FILE}

Attached a GRCh38 file:

vep_synonym.vcf.gz

Command line is:

perlbrew_runner.sh /data/annotation/VEP/ensembl-vep/vep -i vep_synonym.vcf.gz -o /tmp/out.vcf --cache --dir /data/annotation/VEP/vep_cache --assembly GRCh38 --offline --vcf --force_overwrite --use_given_ref --no_stats --refseq --plugin StructuralVariantOverlap,file=/data/annotation/VEP/annotation_data/GRCh38/gnomad.v4.0.sv.merged.vcf.gz --fasta /data/annotation/fasta/GCF_000001405.40_GRCh38.p14_genomic.fna.gz

Removing the "--fasta" argument stops the conversion.

If I download the fasta from the INSTALL.py script - it also goes away

If I modify StructuralVariantOverlap to print what it is sending to tabix (ie print("Sending tabix: $pos_string_chr\n");

And then grep the output for "NC_" I get:

Sending tabix: NC_000008.11:145065749-145065750
Sending tabix: NC_000009.12:999999-1000000
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "##fileformat=VCFv4.2"
Reading "/data/annotation/VEP/annotation_data/GRCh38/gnomad.v4.0.sv.merged.vcf.gz" failed: No such file or directory
Sending tabix: NC_000013.11:999999-1000000
Sending tabix: NC_000014.9:106639379-106639380
Sending tabix: NC_000015.10:101922990-101922991
Sending tabix: NC_000016.10:999999-1000000
Sending tabix: NC_000021.9:999999-1000000
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "##fileformat=VCFv4.2"
Reading "/data/annotation/VEP/annotation_data/GRCh38/gnomad.v4.0.sv.merged.vcf.gz" failed: No such file or directory
Sending tabix: NC_000022.11:999999-1000000
Sending tabix: NC_000023.11:999999-1000000
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "##fileformat=VCFv4.2"
Reading "/data/annotation/VEP/annotation_data/GRCh38/gnomad.v4.0.sv.merged.vcf.gz" failed: No such file or directory
Sending tabix: NC_000024.10:999999-1000000
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "##fileformat=VCFv4.2"
Reading "/data/annotation/VEP/annotation_data/GRCh38/gnomad.v4.0.sv.merged.vcf.gz" failed: No such file or directory

Version

Tested on RefSeq 110 (GRCh37 and GRCh38) Using Perl 5.30.2

Note on changes / synonym files

I originally thought this had to do with the synonyms file, but I used my own synonym file and it made no difference

I then used the fasta from the VEP download, and it appears to have fixed the issue

davmlaw commented 3 months ago

I found this was due to the fasta file used, rather than the synonyms file. I have updated the title and issue

likhitha-surapaneni commented 3 months ago

Hi @davmlaw , thank you for reporting this and giving the details.

We are able to reproduce this on our end. We are investigating this further and will let you know once we have an update.

Kind regards, Likhitha

likhitha-surapaneni commented 2 months ago

Hi @davmlaw , this issue has now been fixed and will be available in the upcoming release. Thank you and please feel free to open a new issue if there are any other problems.