althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
128 stars 12 forks source link

ValueError: Could Not Determine Alphabet of File When Using digital=True in esl.SequenceFile #80

Closed BioGavin closed 3 weeks ago

BioGavin commented 3 weeks ago

Hi, authors. I’m encountering an issue when trying to read a file using esl.SequenceFile with the digital=True parameter. Here is the code I’m using for test:

import pyhmmer.easel as esl
in_fasta_path = "test.fa"
sequences = esl.SequenceFile(in_fasta_path, digital=True)
for sequence in sequences:
    print(f"Name: {sequence.name.decode('utf-8')}")
    print(sequence.sequence)

The test.fa file contains the following sequence in FASTA format:

>bgc:465365|cds:8530054|hsp:8934241|18-46
TYYGNGVSCDDKKCTVDWGKAWSCGADR

When I set digital=True, I get the following error:

Traceback (most recent call last):
  File "/home/gavin/bigslice-cj/debug/read_fa.py", line 7, in <module>
    sequences = esl.SequenceFile(in_fasta_path, digital=True)
  File "pyhmmer/easel.pyx", line 6289, in pyhmmer.easel.SequenceFile.__init__
  File "pyhmmer/easel.pyx", line 6283, in pyhmmer.easel.SequenceFile.__init__
ValueError: Could not determine alphabet of file: 'test.fa'

If I don't set digital, it can run successfully and the output is here:

/home/gavin/miniconda3/envs/bigslice/bin/python /home/gavin/bigslice-cj/debug/read_fa.py 
Name: bgc:465365|cds:8530054|hsp:8934241|18-46
TYYGNGVSCDDKKCTVDWGKAWSCGADR

Process finished with exit code 0

Here is the version information of pyhmmer I used:

Name: pyhmmer
Version: 0.10.15
Summary: Cython bindings and Python interface to HMMER3.
Home-page: https://github.com/althonos/pyhmmer
Author: Martin Larralde
Author-email: martin.larralde@embl.de
License: MIT
Location: /home/gavin/miniconda3/envs/bigslice/lib/python3.8/site-packages
Requires: psutil
Required-by: bigslice

I understand that the digital=True parameter is intended to convert amino acid letters to numeric values in the range 0-19. I have carefully checked my input sequence to ensure there are no invalid amino acid letters; all characters in the sequence conform to the standard protein alphabet. Despite this, I am still encountering the ValueError: Could not determine alphabet of file error. This is quite puzzling, and I would appreciate any guidance or insight you could provide on this issue.

Thank you for your help!

althonos commented 3 weeks ago

Hi @BioGavin

This is quite likely coming from HMMER not being able to determine the alphabet of your sequence file because it is too short, and since digital=True requires an alphabet to succeed, the parser fails in digital mode but not in text mode.

If you know your sequences are always protein sequences you can provide an alphabet yourself:

import pyhmmer.easel as esl
in_fasta_path = "test.fa"
alphabet = esl.Alphabet.amino()
sequences = esl.SequenceFile(in_fasta_path, digital=True, alphabet=alphabet)
for sequence in sequences:
    print(f"Name: {sequence.name.decode('utf-8')}")
    print(sequence.sequence)
BioGavin commented 3 weeks ago

Thank you for your response. This solution worked perfectly, and the code now runs successfully.