EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
317 stars 70 forks source link

Fix Iss#195 - confusing error message for badly-formatted target database #200

Closed traviswheeler closed 4 years ago

traviswheeler commented 4 years ago

This PR addresses the concern raised in https://github.com/EddyRivasLab/hmmer/issues/195: nhmmer provides a confusing error message when the target database fails autodetection.

The issue was that nhmmer's target autodetection code follows a path that ends in checking if the input is in the binary FMindex format, and those functions (in fm_general.c) were previously written to handle errors with a call to esl_fatal(). This would cause nhmmer to exit and give an FM-specific error, even though the file might be anything (Nick found the error with a mis-formatted embl file, and reproduced with an image file). The solution depends on changing fm_general.c error handling to do what it should always have done: return an error status, so that the calling function can clean up as appropriate. That's what's happening in this PR.

Error handling now gives messages that make sense:

% nhmmer   MADE1.hmm  img.png
Error: Failed to autodetect format for target sequence database img.png
% nhmmer --tformat fmindex   MADE1.hmm  img.png
Error: Failed to read FM meta data from target sequence database img.png

... and it still works for guided and autodetect on target db:

% makehmmerdb Dfam_ID_1st.embl Dfam.fm
% nhmmer   MADE1.hmm  Dfam.fm
   (works)
% nhmmer  --tformat fmindex  MADE1.hmm  Dfam.fm
   (works)

~~

Note: After the changes here, the initial input that raised this issue for Nick still leads to failure, just a different kind:

% nhmmer   MADE1.hmm  Dfam.embl
Error: Failed to autodetect format for target sequence database Dfam.embl

That's because easel's header_embl() function is designed to expect the first line of the file to begin with ID, while Dfam.embl leads off with a comment line (CC). Those first CC lines in Dfam.embl are being used as general comments about the file (i.e. not specific to any entry), but that appears to be not in compliance: http://us.expasy.org/sprot/userman.html says "The ID ... line is always the first line of an entry" and "comments always appear below the last reference line". I'd say the error is accurately complaining that the file isn't matching a known format.

When I fix Dfam.embl to place the ID line first, it works ... until it bumps into an error in Easel's sqascii_ReadBlock() function that fails to properly handle the "//" delimiter at the end of the first block of sequences. That error is also fixed (in Easel), and will be the subject of an Easel PR in just a moment.

cryptogenomicon commented 4 years ago

Thanks!