Fix for rare "Invalid alphabet type" nhmmer error

traviswheeler commented 3 years ago

The following error can arise in edge case inputs:

% nhmmer --dna A.fa B.fa Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.

An example input that will produce this error is:

% cat A.fa

seq1 AA

% cat B.fa

seq1 AC


Note: The --dna flag tells nhmmer that the QUERY is in DNA format.
It does not assert anything about the target, and there isn't
a flag that does.

There is a check in the first step of reading the target that aims
to stop a user from providing a protein target sequence to nhmmer:

  esl_sqfile_GuessAlphabet(dbfp, &q_type);
  if (! (q_type == eslDNA || q_type == eslRNA))
    p7_Fail("Invalid alphabet type in target for nhmmer. Expect DNA or RNA.\n");

The problem arises when esl_sqfile_GuessAlphabet() is unable to
guess the target alphabet. This can happen when the target is
not long/diverse enough to provide the guesser with enough
information. Two examples beyond A.fa and B.fa above:

>seq1 - too short  (needs to be >10 characters)
ACGTACGTAC

>seq1 - does not include all 4 nucleotides.
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

These kind of inputs seem likely to be sample inputs to
test that nhmmer runs, not true sequences. In any event,
we'd like nhmmer to run.

We can overcome this error by encoding the approach "if the
guesser can't figure it out, we'll assume that the query
format is the guide". This is acheived by simply allowing
eslUNKNOWN as the result of a guess:

      if (! (q_type == eslDNA || q_type == eslRNA || q_type == eslUNKNOWN))

If the input sequence is guessed as a protein, it'll
still correctly induce the error.

That's the change found in this commit.

Note: it is possible for a user to still sneak past this test, e.g. with the input:

target_seq EE

It has illegal letters for DNA, but is too short for the guesser to guess. As committed, nhmmer will just breeze past that sequence with no match. An input with >10 letters containing any illegal DNA character will produce the previous message:

% cat A.fa

seq1 ACGTACGTAC

% cat E.fa

aa_seq AAAAAAAAAAE

% nhmmer --dna A.fa E.fa Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.



I'm of the opinion that this "short amino sequences don't raise
and error" issue is ok, since that's deep in user error territory
(and such a short sequence would never produce enough score to
yield a match anyway). That said, if we want to be more robust
about invalid amino acid letters in too-short inputs, we could
create an easel function that tests if a supposed string is a
legal match for an input type. To my knowledge, this function
doesn't currently exist. It seems too heavyweight for the problem
at hand, but I'm open to going that route, if you think it's
necessary.

cryptogenomicon commented 3 years ago

Looks good - thanks!

jksull commented 2 years ago

still getting this error even with the develop branch (which was supposedly fixed a few days ago). When manually encoding the fix on the main branch, the error subsides but instead am met with a segmentation fault error for all the libraries where the above 'invalid alphabet' error was previously present.

traviswheeler commented 2 years ago

I have pulled the updates into my local clone of the develop branch, and I do not receive an error when I run a command like: nhmmer --dna query.fa target-10k.fa

You haven't provided much detail, so it's hard to be sure exactly what problem you're running into. Can you please do the following:

run 'git pull' to ensure that you have the fix.
if the error persists: -- provide us with the exact command you are running, and the exact error you are receiving. -- confirm that you are using the files provided in the original issue; if not, tell us that files you are using.

Thanks

EddyRivasLab / hmmer

Fix for rare "Invalid alphabet type" nhmmer error #252