EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
307 stars 69 forks source link

Error: Invalid alphabet type in target / expected FASTA to start with > #271

Closed ptrebert closed 2 years ago

ptrebert commented 2 years ago

Hi, I am trying to run HMMER on a pair of query/target FASTA files; both files were confirmed not to contain any other char but ACGT, both files are formatted as single-line FASTA files. The query file has a single entry, the target several hundred.

Now, I started by running HMMER v3.3.2 installed via Conda, and encountered the invalid alphabet error:

Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.

I found the PR #252 and built HMMER from source (latest commit to develop #8ab8e8b ; EASEL was included from develop as described in the README). Now, when I rerun the above data with the --dna switch, I get the following error:

Parse failed (sequence file assembly.fasta):
Line 2: unexpected char A; expected FASTA to start with >

The assembly.fasta file starts as follows (line numbers included for readability):

1 >utig4-953
2 CCCTAACCCTAACC[...]
3 >utig4-1407
4 TCCAAGTAACATC[...]
...

I assume a trivial formatting issue is causing this, but the error message is quite confusing. Thanks for your help.

Best, Peter

cryptogenomicon commented 2 years ago

Can you send a reproducible test case w/ files and command line, please?

ptrebert commented 2 years ago

Confidentially, yes - can I share that via Globus?

cryptogenomicon commented 2 years ago

Preferable to create a small and non-confidential test case, if possible.

ptrebert commented 2 years ago

Ok, I truncated both sequences, but the error is still being triggered:

$ nhmmer --cpu 1 -o text.out --tblout table.out -E 1.60E-150 --dna query.fasta target-10k.fasta

Parse failed (sequence file target-10k.fasta):
Line 2: unexpected char A; expected FASTA to start with >

Sequence composition of the target is A 3332 C 5002 G 0 T1666 testcase.tar.gz

ptrebert commented 2 years ago

@cryptogenomicon I have another occurrence of the same error in a different sample in case the test data above are not sufficient to diagnose and fix the problem

ptrebert commented 2 years ago

Hi @cryptogenomicon can you estimate when a fix for this issue will be available in develop?

traviswheeler commented 2 years ago

I've just noticed this issue thread, and that the problem is showing up in nhmmer. I can reproduce the error. After a bit of exploration, it looks like the error disappears if a single 'G' is added anywhere in the first half (or so) of the target sequence. That's not the expected behavior (of course). I also confirm that phmmer on the same input does not produce an error.

I can take a deeper look at this tomorrow, unless someone else is knee deep in the problem.

ptrebert commented 2 years ago

Thanks @traviswheeler for taking care of this!

npcarter commented 2 years ago

A fix for this issue has been merged into the develop branch.

ptrebert commented 2 years ago

Thanks a lot!