custom alphabet support

jsogin574 commented 3 years ago

I may be misunderstanding some of the user manual and portions of programs, but it seems certain HMMER functions are able to handle pHMMs and databases with custom alphabets. I am attempting to run hmmsearch with a pHMM and database that use a custom alphabet but keep getting the error 'bad file format in HMM file XXXXXXXX'.

Is running hmmsearch using a custom alphabet not supported or is this a true file format error on my end?

cryptogenomicon commented 3 years ago

The code is built to allow for custom alphabets, but this doesn't get thoroughly tested. It's certainly rarely used, and I would not be surprised if it required code modifications for any real work.

If you want to provide a reproducible test case, I can look at it.

jsogin574 commented 3 years ago

That would be great. Thanks.

The premise for my use is to compute a pHMM from Q8 protein structure predictions and then use it to search for remote homologues in a larger database of Q8 predictions. I know some pHMM models account for primary sequence and Q3 structural predictions, but I'm not aware of any similar ones that account for Q8 predictions. Nonetheless, the custom alphabet is {H, B, E, G, I, T, S, L}.

I've attached the pHMM (phmm.txt) and fasta database (database.txt) I am attempting to search with it. The pHMM was computed from an alignment of Q8 structure predictions in the R package aphid (because I could not figure out how to use HMMER to compute it). The database includes the list of Q8 predictions I used to compute the pHMM, so I am expecting all the members of the database to be significant hits. The pHMM is by no means fine-tuned; I am simply going through the actions at this point.

OS: Ubuntu 20.04.2 LTS running on WSL2 (5.10.16.3-microsoft-standard-WSL2; x86-64 architecture) HMMER version: 3.3.2 compiled from source

command: hmmsearch -o test.out phmm.txt database.txt output: bad alphabet type: unrecognized

phmm.txt database.txt

cryptogenomicon commented 3 years ago

OK, that makes sense, that's the sort of error I was expecting from hmmsearch, whereas a bad file format error sounded more like a bug. As you said, some HMMER functions and programs accept various alphabets including custom ones -- including the file format parser, which should accept the format of a custom alphabet HMM file fine -- but hmmsearch itself is only designed for proteins. To use hmmsearch with a custom alphabet would require at least minor code modifications. (Possibly including sequence input calls, which are also verifying that the input sequence file contains protein sequence - i.e. it's not only the HMM alphabet that matters here.)

jsogin574 commented 3 years ago

Got it. I can't reproduce 'bad file format in HMM file XXXXXXXX' so maybe it was something weird with the pHMM file. Thanks for the help; I'll dig a bit more into hmmsearch.

jsogin574 commented 3 years ago

The bad alphabet type: unrecognized error was being thrown by easel because it was unable to parse the hmm file.

It seems easel only recognizes RNA, DNA, Amino, Coins, and Dice. The error is thrown specifically at line 63 of esl_alphabet.c: default: esl_fatal("bad alphabet type: unrecognized"); // violation: must be a code error, not user. .

cryptogenomicon commented 3 years ago

OK, thanks, sounds right.

EddyRivasLab / hmmer

custom alphabet support #248