EddyRivasLab / easel

Sequence analysis library used by Eddy/Rivas lab code
Other
46 stars 26 forks source link

Fix inconsistent `esl_sqfile_GuessAlphabet` behaviour on file with empty sequences #61

Closed althonos closed 2 years ago

althonos commented 2 years ago

Hi!

esl_sqfile_GuessAlphabet is supposed to return eslNOALPHABET when it cannot guess the alphabet of a sequence file. On files containing only empty sequences, this would be the expected return code. However when writing unit tests for pyhmmer I noticed that it would unexpectedly return eslEOD instead.

Turns out sqascii_GuessAlphabet calls sqascii_ReadWindow, which can return eslEOD when it reaches the end of a sequence, but that case was not handled properly:

  status = sqascii_ReadWindow(sqfp, 0, 4000, sq);
  if      ((status == eslEOF)) { status = eslENODATA; goto ERROR; }
  else if (status != eslOK)  goto ERROR;

This PR adds a unit test to make sure eslNOALPHABET is returned on files with empty sequences, and replaces the code above with:

  status = sqascii_ReadWindow(sqfp, 0, 4000, sq);
  if      ((status == eslEOF)) { status = eslENODATA; goto ERROR; }
  else if ((status != eslOK) && (status != eslEOD))  goto ERROR;

to make sure that eslEOD is not considered an error here.

cryptogenomicon commented 2 years ago

@npcarter, could you review this PR for us?