Haplosaurus errors on start codon loss

rrlove-cdc commented 11 months ago

Describe the issue

VEP fails with the error message "Can't call method "each_seq" on an undefined value at \ensembl-vep/Bio/EnsEMBL/Variation/TranscriptHaplotype.pm line 451, line 75759."

Immediately preceding this is the warning "MSG: cannot align sequences with length less than 2."

The variant in question appears to delete a start codon. This appears to be a very similar issue to #1291 .

Additional information

Please fill in the following sections to help us find the source of your issue as quickly as possible.

System

VEP version: 110.1
VEP Cache version: 57 (the most recent release of Ensembl Metazoa)
Perl version: v5.16.3
OS: CentOS
tabix installed ? yes

Full VEP command line

~/ensembl-vep/haplo \
-i inputs/haplosaurus_test.vcf \
-o outputs/haplosaurus_test.110723.txt \
--species anopheles_gambiae \
--cache \
--cache_version 57 \
--offline

Full error message

--------------------- WARNING --------------------- MSG: cannot align sequences with length less than 2 --------------------------------------------------- Can't call method "each_seq" on an undefined value at \ensembl-vep/Bio/EnsEMBL/Variation/TranscriptHaplotype.pm line 451, line 27.

Data files (if applicable)

Contents of haplosaurus_test.vcf:

##fileformat=VCFv4.1
##fileDate=20231018
#CHROM  POS     ID      REF     ALT     QUAL    INFO    FILTER  FORMAT  SAMP1       SAMP2
3L      19867099        .       AT      A       13577.1 .       .       GT      1|1    0|1

dglemos commented 11 months ago

Hi @rrlove-cdc, Thanks for creating a new issue!

Unfortunately, I cannot reproduce the error. How did you download the cache and which commands you used to prepare the cache files?

rrlove-cdc commented 11 months ago

Thanks for taking a look. As best as I can remember, I downloaded the cache using the commands found here under "Non-indexed cache." I did not do any additional preparation after that. So, something like

curl -O https://ftp.ebi.ac.uk/ensemblgenomes/pub/metazoa/current/variation/vep/anopheles_gambiae_vep_57_AgamP4.tar.gz && tar xzf anopheles_gambiae_vep_57_AgamP4.tar.gz

dglemos commented 11 months ago

That looks good, the cache should be ready to be used.

Can you try to run VEP with the indexed cache? You can download it from here: https://ftp.ensemblgenomes.ebi.ac.uk/pub/metazoa/release-57/variation/indexed_vep_cache/

rrlove-cdc commented 10 months ago

I reran with the indexed cache, but unfortunately the jobs are erroring out at the same spot with the same messages.

rrlove-cdc commented 8 months ago

I left an important detail out of my original post: this error only happens when I use haplosaurus with the bioperl-ext package. (Running haplosaurus without bioperl-ext doesn't give the error, but also doesn't complete; a 60 megabase chunk of the genome will run for a long time and/or give an out-of-memory error even with 48 GB memory.) I have also found that the same "MSG: cannot align sequences with length less than 2" error occurs when I supply gff and fasta files instead of using the cache.

rrlove-cdc commented 8 months ago

This error also occurs when using haplosaurus through singularity, with either the cache or with supplied gff and fasta files.

rrlove-cdc commented 7 months ago

This error appears to be triggered by a combination of three conditions:

a.) the first codon is a stop codon, and b.) the sequence contains an indel, and c.) haplosaurus is being run with the bioperl-ext package

I've found two specific edge cases where this occurs, assuming bioperl-ext is being used:

1.) The three bases of the start codon all independently change via SNPs to a stop codon, and there is an indel downstream. This is incredibly unlikely in real data.

2.) A reverse-strand CDS with a second codon starting with "A" has its first base deleted. As a result of this frameshift, the first codon becomes "TGA." This is my original error, as well as that observed in #1291 .

Some other edge cases that do not trigger the same error, and my best guess as to why:

A forward-strand CDS with SNPs in each base of the start codon causing it to become a stop codon is correctly output as a "stop_change" if there's no indel elsewhere in the CDS. I believe this is because sequences with and without indels use different alignment algorithms as per https://github.com/Ensembl/ensembl-variation/blob/release/108/modules/Bio/EnsEMBL/Variation/TranscriptHaplotype.pm#L488, and the Bio::Tools::dpAlign is what errors out on an initial stop codon.
A forward-strand CDS with a SNP mutation such that the second codon becomes a stop codon, plus a downstream in-frame indel, completes without error. Presumably this is because the stop codon "counts" towards the sequence length, so the 2-codon sequence is long enough to be aligned.
A forward-strand CDS with a second codon starting with "A" having its first base deleted results in:

WARNING: genomic coord 60-61 possibly maps across coding/non-coding boundary in gene2.R1

-------------------- WARNING ----------------------
MSG: Haplosaurus can't find transcripts overlapping your variant(s). The output is empty.

The VCF convention is that variants are always recorded relative to the forward strand, and indels are recorded relative to the last unaffected base; therefore, the position of a variant that deletes only the first base of a forward-strand CDS will be upstream of the CDS boundary. I have not found a way of getting haplosaurus to process this kind of deletion, which may be an edge case of its own. However, the position of a variant that deletes the first base of a reverse-strand CDS will be inside the CDS boundary by one base, so it is processed as a deletion.

Once I narrowed down the trigger, I was able to remove the two instances occurring in my input files and rerun the rest of the data successfully.

Jokendo-collab commented 5 months ago

I am still getting this even after removing the problematic genomic coordinate:

--------------------- WARNING ---------------------
MSG: cannot align sequences with length less than 2
---------------------------------------------------
Can't call method "each_seq" on an undefined value at /opt/ensembl-vep/Bio/EnsEMBL/Variation/TranscriptHaplotype.pm line 451, <__ANONIO__> line 23243.

rrlove-cdc commented 5 months ago

I had to remove two sites in my data before haplosaurus completed successfully. Is it possible that there are additional sites in yours?

Jokendo-collab commented 5 months ago

You mean duplicate variants?

rrlove-cdc commented 5 months ago

My data are genome-wide. After I removed the first site that was causing the error, I then had to remove a different site on another chromosome that caused the same error. Because the error message doesn't specify the coordinate of the site causing the error, it can be difficult to tell that you're now getting the same error because of a different site.

Ensembl / ensembl-vep