The issue was that nhmmer's target autodetection code follows a path that ends in checking if the input is in the binary FMindex format, and those functions (in fm_general.c) were previously written to handle errors with a call to esl_fatal(). This would cause nhmmer to exit and give an FM-specific error, even though the file might be anything (Nick found the error with a mis-formatted embl file, and reproduced with an image file). The solution depends on changing fm_general.c error handling to do what it should always have done: return an error status, so that the calling function can clean up as appropriate. That's what's happening in this PR.
Error handling now gives messages that make sense:
% nhmmer MADE1.hmm img.png
Error: Failed to autodetect format for target sequence database img.png
% nhmmer --tformat fmindex MADE1.hmm img.png
Error: Failed to read FM meta data from target sequence database img.png
... and it still works for guided and autodetect on target db:
Note: After the changes here, the initial input that raised this issue for Nick still leads to failure, just a different kind:
% nhmmer MADE1.hmm Dfam.embl
Error: Failed to autodetect format for target sequence database Dfam.embl
That's because easel's header_embl() function is designed to expect the first line of the file to begin with ID, while Dfam.embl leads off with a comment line (CC). Those first CC lines in Dfam.embl are being used as general comments about the file (i.e. not specific to any entry), but that appears to be not in compliance: http://us.expasy.org/sprot/userman.html says "The ID ... line is always the first line of an entry" and "comments always appear below the last reference line". I'd say the error is accurately complaining that the file isn't matching a known format.
When I fix Dfam.embl to place the ID line first, it works ... until it bumps into an error in Easel's sqascii_ReadBlock() function that fails to properly handle the "//" delimiter at the end of the first block of sequences. That error is also fixed (in Easel), and will be the subject of an Easel PR in just a moment.
This PR addresses the concern raised in https://github.com/EddyRivasLab/hmmer/issues/195: nhmmer provides a confusing error message when the target database fails autodetection.
The issue was that nhmmer's target autodetection code follows a path that ends in checking if the input is in the binary FMindex format, and those functions (in fm_general.c) were previously written to handle errors with a call to esl_fatal(). This would cause nhmmer to exit and give an FM-specific error, even though the file might be anything (Nick found the error with a mis-formatted embl file, and reproduced with an image file). The solution depends on changing fm_general.c error handling to do what it should always have done: return an error status, so that the calling function can clean up as appropriate. That's what's happening in this PR.
Error handling now gives messages that make sense:
... and it still works for guided and autodetect on target db:
~~
Note: After the changes here, the initial input that raised this issue for Nick still leads to failure, just a different kind:
That's because easel's header_embl() function is designed to expect the first line of the file to begin with ID, while Dfam.embl leads off with a comment line (CC). Those first CC lines in Dfam.embl are being used as general comments about the file (i.e. not specific to any entry), but that appears to be not in compliance: http://us.expasy.org/sprot/userman.html says "The ID ... line is always the first line of an entry" and "comments always appear below the last reference line". I'd say the error is accurately complaining that the file isn't matching a known format.
When I fix Dfam.embl to place the ID line first, it works ... until it bumps into an error in Easel's sqascii_ReadBlock() function that fails to properly handle the "//" delimiter at the end of the first block of sequences. That error is also fixed (in Easel), and will be the subject of an Easel PR in just a moment.