Failure to do a global alignment (`nw`) of really long proteins

althonos / pyopal

Cython bindings and Python interface to Opal, a SIMD-accelerated database search aligner.

https://pyopal.readthedocs.io

MIT License

8 stars 1 forks source link

Failure to do a global alignment (`nw`) of really long proteins #3

Closed valentynbez closed 1 year ago

valentynbez commented 1 year ago

I align some extremely long (>20,000 aa) proteins against AF2DB. Opal fails to do that at ~30,000 aa with RuntimeError: Failed to run search Opal database (code=1) for 2 longest sequences in the dataset, both in score and full mode.

Code to reproduce:

import pyopal
from pysam.libcfaidx import FastxFile

target = [x.sequence for x in FastxFile("longest.fa")]
database = [x.sequence for x in FastxFile("af2db_hits.faa")]
database = pyopal.Database(database)

for sequence in target:
    try:
        results = database.search(sequence, mode="full", algorithm="nw")
    except RuntimeError:
        print("Failure", len(sequence))

data.zip

valentynbez commented 1 year ago

I tried to replicate it with randomly generated sequences, but now it always fails. Is this an alphabet issue?

import pyopal
from random import choice
alphabet = 'ACDEFGHIKLMNPQRSTVWY'
# generate proteins with length from 1k to 36k in steps of 1k
proteins = ["".join([choice(alphabet) for _ in range(i)]) for i in range(1000, 36000, 1000)]
database = pyopal.Database(proteins)

for sequence in proteins:
    try:
        results = database.search(sequence, mode="score", algorithm="nw")
    except RuntimeError:
        print("Failure", len(sequence))

althonos commented 1 year ago

I'll have a look when I have more time, I'm also getting failures with the random sequences.

althonos commented 1 year ago

Overflow detection was turned off for 32-bit integer scoring. This only affected the non-SW alignment modes, as SW uses a different function where overflow worked as expected. I updated the code in my fork of Opal.

althonos commented 1 year ago

Fixed in v0.4.1.