Aborted (core dumped) occurred while doing translate

chtsai0105 commented 1 year ago

Hi - I was trying to load a dna coding sequence and want to further translate to peptide sequence for hmmsearch. I loaded the gzipped fasta dna sequence with pyhmmer.easel.SequenceFile and everything is fine so far, the Alphabet correctly identify it as dna. However when I do translate my jupyter kernel keep crashing. I also tried it with regular script but still got Aborted (core dumped) failure.

I did a quick test to find out what going on. I found that when I can successfully translate the SequenceBlock with the first 5 entries. I can also successfully translate the block from 6th to 20th entries. But when I do it with 0 to 6, it crashed. I also checked the fasta file but cannot found anything that is suspicious.

And this failure seems to be vary between machines. I did the test on another host and it success for the 0 to 6 block. But when I tried to translate the entire block it will eventually failed.

Not sure whether there is anything that I didn't noticed in my fasta so I included the fasta below. Actinomucor_elegans_CBS_100.09.cds.fasta.gz

And here is my test script - it usually failed at the last line but in would failed in seqs[0:6].translate() when running on jupyter. I've already updated the pyhmmer to the latest version 0.10.1.

import pyhmmer

print(pyhmmer.__version__)
seq_file = pyhmmer.easel.SequenceFile("Actinomucor_elegans_CBS_100.09.cds.fasta.gz", digital=True)
seqs = seq_file.read_block()
peps = seqs[0:5].translate()
print("Try to print peptide")
print(peps[-1].sequence)
peps = seqs[6:20].translate()
print("Try to print peptide")
print(peps[0].sequence)
peps = seqs[0:6].translate()
print("Try to print peptide")
print(peps[-1].sequence)

althonos commented 1 year ago

Hi @chtsai0105, thank you for the bug report. I'm also getting a segfault using the test snippet you provided, I'll try to understand what's going on.

althonos commented 1 year ago

The DigitalSequenceBlock.translate method was occasionally writing to the wrong buffer, namely the sequence of the last protein of the block. Depending to how you'd slice, you could end up writing out of the buffer bounds, and get a segfault later! That's why you were seeing different behaviour based on the sequence coordinates.

I have patched the bug, updated tests to pick up any regression, and will make a new release with the patch.

chtsai0105 commented 1 year ago

Thank you for such a quick update! Have confirmed that it behave as expected.

althonos / pyhmmer

Aborted (core dumped) occurred while doing translate #50