Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
449 stars 151 forks source link

Haplosaurus errors on start codon loss #1549

Open rrlove-cdc opened 11 months ago

rrlove-cdc commented 11 months ago

Describe the issue

VEP fails with the error message "Can't call method "each_seq" on an undefined value at \ensembl-vep/Bio/EnsEMBL/Variation/TranscriptHaplotype.pm line 451, line 75759."

Immediately preceding this is the warning "MSG: cannot align sequences with length less than 2."

The variant in question appears to delete a start codon. This appears to be a very similar issue to #1291 .

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

Full VEP command line

~/ensembl-vep/haplo \
-i inputs/haplosaurus_test.vcf \
-o outputs/haplosaurus_test.110723.txt \
--species anopheles_gambiae \
--cache \
--cache_version 57 \
--offline

Full error message

--------------------- WARNING --------------------- MSG: cannot align sequences with length less than 2 --------------------------------------------------- Can't call method "each_seq" on an undefined value at \ensembl-vep/Bio/EnsEMBL/Variation/TranscriptHaplotype.pm line 451, line 27.

Data files (if applicable)

Contents of haplosaurus_test.vcf:

##fileformat=VCFv4.1
##fileDate=20231018
#CHROM  POS     ID      REF     ALT     QUAL    INFO    FILTER  FORMAT  SAMP1       SAMP2
3L      19867099        .       AT      A       13577.1 .       .       GT      1|1    0|1
dglemos commented 11 months ago

Hi @rrlove-cdc, Thanks for creating a new issue!

Unfortunately, I cannot reproduce the error. How did you download the cache and which commands you used to prepare the cache files?

rrlove-cdc commented 11 months ago

Thanks for taking a look. As best as I can remember, I downloaded the cache using the commands found here under "Non-indexed cache." I did not do any additional preparation after that. So, something like

curl -O https://ftp.ebi.ac.uk/ensemblgenomes/pub/metazoa/current/variation/vep/anopheles_gambiae_vep_57_AgamP4.tar.gz && tar xzf anopheles_gambiae_vep_57_AgamP4.tar.gz
dglemos commented 11 months ago

That looks good, the cache should be ready to be used.

Can you try to run VEP with the indexed cache? You can download it from here: https://ftp.ensemblgenomes.ebi.ac.uk/pub/metazoa/release-57/variation/indexed_vep_cache/

rrlove-cdc commented 10 months ago

I reran with the indexed cache, but unfortunately the jobs are erroring out at the same spot with the same messages.

rrlove-cdc commented 8 months ago

I left an important detail out of my original post: this error only happens when I use haplosaurus with the bioperl-ext package. (Running haplosaurus without bioperl-ext doesn't give the error, but also doesn't complete; a 60 megabase chunk of the genome will run for a long time and/or give an out-of-memory error even with 48 GB memory.) I have also found that the same "MSG: cannot align sequences with length less than 2" error occurs when I supply gff and fasta files instead of using the cache.

rrlove-cdc commented 8 months ago

This error also occurs when using haplosaurus through singularity, with either the cache or with supplied gff and fasta files.

rrlove-cdc commented 7 months ago

This error appears to be triggered by a combination of three conditions:

a.) the first codon is a stop codon, and b.) the sequence contains an indel, and c.) haplosaurus is being run with the bioperl-ext package

I've found two specific edge cases where this occurs, assuming bioperl-ext is being used:

1.) The three bases of the start codon all independently change via SNPs to a stop codon, and there is an indel downstream. This is incredibly unlikely in real data.

2.) A reverse-strand CDS with a second codon starting with "A" has its first base deleted. As a result of this frameshift, the first codon becomes "TGA." This is my original error, as well as that observed in #1291 .

Some other edge cases that do not trigger the same error, and my best guess as to why:

WARNING: genomic coord 60-61 possibly maps across coding/non-coding boundary in gene2.R1

-------------------- WARNING ----------------------
MSG: Haplosaurus can't find transcripts overlapping your variant(s). The output is empty.

The VCF convention is that variants are always recorded relative to the forward strand, and indels are recorded relative to the last unaffected base; therefore, the position of a variant that deletes only the first base of a forward-strand CDS will be upstream of the CDS boundary. I have not found a way of getting haplosaurus to process this kind of deletion, which may be an edge case of its own. However, the position of a variant that deletes the first base of a reverse-strand CDS will be inside the CDS boundary by one base, so it is processed as a deletion.

Once I narrowed down the trigger, I was able to remove the two instances occurring in my input files and rerun the rest of the data successfully.

Jokendo-collab commented 5 months ago

I am still getting this even after removing the problematic genomic coordinate:

--------------------- WARNING ---------------------
MSG: cannot align sequences with length less than 2
---------------------------------------------------
Can't call method "each_seq" on an undefined value at /opt/ensembl-vep/Bio/EnsEMBL/Variation/TranscriptHaplotype.pm line 451, <__ANONIO__> line 23243.
rrlove-cdc commented 5 months ago

I had to remove two sites in my data before haplosaurus completed successfully. Is it possible that there are additional sites in yours?

Jokendo-collab commented 5 months ago

You mean duplicate variants?

rrlove-cdc commented 5 months ago

My data are genome-wide. After I removed the first site that was causing the error, I then had to remove a different site on another chromosome that caused the same error. Because the error message doesn't specify the coordinate of the site causing the error, it can be difficult to tell that you're now getting the same error because of a different site.