Fix for iss#171: guessing query file format

traviswheeler commented 4 years ago

This is a much delayed fix for issue #171.

If nhmmer is called without guidance on the input type, it tries to guess whether the query is an HMM, MSA, or unaligned file of seqs. It previously did this by trying to open the file first as an HMM, then as an MSA, and finally as a sequence file - if it succeeded at any point, it would decide this was the correct input type. This over-eager guessing strategy could lead to unexpected results:

when reading a sequence file with a single sequence, nhmmer would treat the input as a one-sequence MSA; this is not intuitive.
when reading a sequence file with many sequences all of the same length, nhmmer would guess that this was an MSA, and smash them into a single query, computing one model for the "MSA"; this is just plain wrong

This PR reimplements the format-guessing framework, still seeking to get the guess right, but insisting that the user provide guidance if the choice isn't completely clear. It also simplifies the logic of the guessing code, updates documentation, and removes the --qhmm, --qmsa, and --qseq flags, which were redundant with --qformat. The new strategy is:

If the user has provided guidance w/ a --qformat value do as instructed.
Otherwise, try to open the file as an HMM. This either works or doesn't, with no risk. (note: if this fails and the file is piped, it can't be rewound, so give a "must specify format" message)
If the file is rewindable, check to see if it's obviously an MSA file or obviously a sequence file. If not obvious, we'll force the user to tell us. That looks like this:
- Try to open it as an MSA file
  - if ok (i.e. it opens and passes the MSA check, including that all sequences are the same length)
    - if it's a FASTA or A2M format, it still might be a sequence file
      - if the "MSA" is a single sequence, then rewind and call it a sequence input. Otherwise give "must specify" message
    - otherwise, it's an MSA; proceed accordingly
  - if not ok (i.e. it's not an MSA file)
    - if it's anything useful, it must be a sequence file, proceed accordingly

npcarter commented 4 years ago

Starting to review this pull request. Should there be userguide changes to go along with the man page and code changes? (Or am I just not seeing them?)

From a quick look at the current H3 userguide, it says that nhmmer takes an HMM file, an alignment file, or a file containing a single sequence as inputs, which seems inconsistent with what the pull request describes. I'm assuming that the intended behavior when nhmmer is passed a multi-sequence file is the same as phmmer: build an HMM from each sequence and search the target database with that HMM. Is that correct?

traviswheeler commented 4 years ago

Sorry, my PR text should've mentioned this also: commit dff4440 provides an update to nhmmer.man.in, which makes the appropriate changes to the Userguide (after a ./config).

You've mentioned that the userguide says nhmmer takes "a single sequence" as input (among other options). I may be missing where it's saying that. The first line on the nhmmer section of the Userguide (p 126 in the current version) says "nhmmer is used to search one or more nucleotide queries" ... then later it says that a "query may be either a profile model built using hmmbuild, a sequence alignment, or a single sequence". So yes, a query can be an hmm, MSA, or single sequence ... but there can be multiple queries. If you think that's still unclear, I can try to clarify.

npcarter commented 4 years ago

Ah, looks like there are a couple of places where this idea is described. On page 53, in the tutorial, I see: "Nhmmer accepts a target DNA sequence database in the same formats as hmmsearch (typically FASTA). For the query, it accepts either a profile file as produced by hmmbuild, or a file containing either one DNA sequence or an alignment of multiple DNA sequences." That's the part I keyed on in the documentation.

traviswheeler commented 4 years ago

That helps, thanks. I'll fix that text soon.

traviswheeler commented 4 years ago

I think the newest commit clears up that misleading text - thanks for catching it.

npcarter commented 4 years ago

Some comments from reading the commit changes: overall, it looks reasonable. I noticed two places where cfg->qfmt is compared to integer constants (0 and 100). The code might be more maintainable if these were converted to #defines like the other values you compare this field to. The changes to the userguide handle the issue I noticed, so all is good there. Now to re-pull the modified changes and run some tests.

Also, it looks like there's no longer a way to tell nhmmer that the query file should be in hmm format. Should there be one? There isn't any possibility for confusion about how a file should be interpreted with HMM files, but I could see it potentially being useful to have such a flag when building analysis pipelines, as a check that the right type of input had been provided.

npcarter commented 4 years ago

HMM input file tests: Verified that single- and multi-HMM files were accepted as query inputs, and that specifying filenames in the command and piping inputs to nhmmer gave the same results (run time and performance numbers differed, but were similar).

Verified that specifying any of the --qformat options and passing an HMM file as input generated useful error messages.

npcarter commented 4 years ago

Stockholm input file tests: (For this test, I'm using the MADE1.sto and MADE1.hmm files from the tutorials directory)

Piping a Stockholm input file to nhmmer works as advertised: useful error message without --qformat=stockholm, functional run with.

If I specify --qformat=stockholm, running nhmmer on MADE1.sto (either piping the input or specifying in the command) gives the same result as running nhmmer on MADE1.hmm, modulo file name information and run time.

However, if I omit the --qformat=stockholm flag or specify one of the other qformat flags, nhmmer seems to treat the query as a database of independent sequences, running one query/sequence instead of converting all of the sequences in the Stockholm file into a single HMM.

npcarter commented 4 years ago

A2m format tests: (Run on a2m file generated by using esl-reformat to convert MADE1.sto to a2m)

If I specify --qformat=a2m, nhmmer runs correctly, generates same output as for the corresponding HMM file.

If I omit the --qformat=a2m, I get "Error: Unable to guess query file type; please specify (--qformat)". Is this what's supposed to happen? I've done some looking in the documentation and there's no mention that this type of file has to be specified manually.

Travis, I'm gonna throw this pull request back to you. It needs a thorough testing of every input option against every qformat option, and ditto for the tformat options.

traviswheeler commented 4 years ago

Thanks for the careful review. Responses to recent comments: (1) The constants (0 and 100) are based on a convention established in esl_msa.h and esl_sqio.h. Your question lead me to discover esl_sqio_IsAlignment(), which removes the magic constants. Fixed. (2) I've fixed the problem you found when omitting the "--qformat=stockholm" flag. The state variable used to control downstream logic was being set with an inverted value (the code "knew" it was an MSA, then captured "not an MSA" in the variable). Tested extensively: this now works, and no other features are broken. (3) For your a2m test, I'm betting your input consisted of more than one sequence, all with the same length. In this case, we want nhmmer to give an error, because we can't tell if it's a MSA or a file made up of several single sequence queries. I've made the error message a bit more descriptive.

npcarter commented 4 years ago

Testing the new version on FASTA files, and it seems to me like the code is working the way it's designed to, but not in a way that's going to make the most sense to users. Searching a single FASTA sequence against a FASTA database works correctly, returning the same results whether or not the --qformat=fasta flag is passed. Piping the file in works when --qformat=fasta is passed, as advertised.

I created a file containing two copies of the same sequence and tried searching that. As expected, without a --qformat=fasta flag an error was raised, with the flag it treated the sequences as separate searches.

The confusing behavior came when I passed --qformat=embl. In that case, the search completed, treating the file as two sequences (and noting that the input file was in embl format in the results). Similar results with --genbank.

Searching with --qformat=stockholm returns an error, but searching with -a2m creates an alignment and works fine.

This seems like it would be counter-intuitive behavior for a user coming at nhmmer without having looked at the code. It's feeling like passing --qformat=fasta, embl, or genbank is really saying "this file is a set of separate sequences to search", not "this file is in the specified format". In particular, passing a FASTA file to a search and having the search results tell me I passed an embl or genbank file as input because I set that format flag seems misleading.

Similarly, if I do have a set of identical-length sequences in FASTA format that I want to treat as an alignment (not sure if this ever comes up, but it feels like it must because we now check for it), the way to tell nhmmer to do that is to set --qformat=a2m? That strikes me as odd.

Maybe Sean could chime in with thoughts on what the "correct" behavior is here? To me, if I specify a file format on the command line, I'm telling the tool exactly what format to expect, and it should flag an error if the file isn't in that format, even if it's one the tool can parse. Is the typical behavior different in biology tools?

traviswheeler commented 4 years ago

You're right, passing --qformat=fasta, embl, or genbank should mean "this file is in the specified format". Fixed (this was working for MSA formats, but not for sequence formats). The result of passing a qformat argument that disagrees with the input file now depends on the combination, and boils down to "how does the corresponding format-reading code in easel deal with being given the wrong format".

The --qformat=a2m thing is working correctly: a2m (https://compbio.soe.ucsc.edu/a2m-desc.html) is a modification of the aligned fasta format (what easel calls "afa"). If you hand in an aligned fasta file, and call nhmmer with "--qformat=a2m", you're saying "the query is an aligned file, so treat it as such". nhmmer will see the fasta file, think "yup, that looks like a2m to me (because it is)", build a model from the alignment, and search with it - the behavior you observed is correct.

Finally: it's pretty common to represent a sequence alignment in fasta format - this results in the "set of identical-length sequences in FASTA format" that you mentioned. nhmmer (and esl-reformat) want this to be distinguished using --qformat=afa (as opposed to --qformat=fasta, which says "treat all sequences individually"). I can see why this may be confusing; I've changed the error message to try to guide the user to set the right qformat.

cryptogenomicon commented 4 years ago

Travis is right, --qformat=<fmt> bypasses format autodetection and asserts that the format is <fmt>, so the input does directly to that format parser. If the input is not in that format, the parser should return an appropriate "parse error on line xx" kind of error.

Once you know the format of the file, you also know whether it's an MSA or an unaligned sequence file.

Your problem in nhmmer is that if the input is an MSA, you don't know whether the user intends to search each individual sequence as a query, or to build a profile of the MSA as a query. Usually they'd expect to make a profile, but we do allow (in phmmer for example) MSAs to be opened as if they're files of unaligned sequences to be read in one at a time. You may need to provide an option for query is an MSA and I want to search each individual seq in it as a query.

Not sure about the A2M vs. AFA part of the above conversation though. A2M is not a subset of AFA (aligned FASTA); they only overlap in a few trivially small cases. In A2M, deletions are -, consensus positions are upper case, insertions are lower case, and pad characters (.) for alignment are removed. So in A2M, sequences are usually different lengths (but each one has the same number of -+upper case). In aligned fasta, case doesn't matter and any of several characters including ., - can be used as gaps, and every sequence has the same number of total chars. If a file could be either A2M or AFA (if it consists of ungapped identical-length upper-case sequences, for example), the Easel format autodetector calls it AFA. In fact, the format autodetector will only try to guess A2M format if the file has a .a2m file suffix, for fear of confusing an AFA file; otherwise A2M always has to be specified with --qformat a2m. (See https://github.com/EddyRivasLab/easel/commit/71f175012549a80296bf6c2b49ce27c5b0c06eee for example)

npcarter commented 4 years ago

Ok, that makes sense and is consistent with what I'd expect from programs. So, I guess the open question is "what flag should someone pass if they want to treat a set of sequences as an alignment" for FASTA or the other sequence data types?

-Nick

On Tue, Jun 30, 2020 at 7:52 AM Sean R. Eddy notifications@github.com wrote:

Travis is right, --qformat= bypasses format autodetection and asserts that the format is , so the input does directly to that format parser. If the input is not in that format, the parser should return an appropriate "parse error on line xx" kind of error.

Once you know the format of the file, you also know whether it's an MSA or an unaligned sequence file.

Your problem in nhmmer is that if the input is an MSA, you don't know whether the user intends to search each individual sequence as a query, or to build a profile of the MSA as a query. Usually they'd expect to make a profile, but we do allow (in phmmer for example) MSAs to be opened as if they're files of unaligned sequences to be read in one at a time. You may need to provide an option for query is an MSA and I want to search each individual seq in it as a query.

Not sure about the A2M vs. AFA part of the above conversation though. A2M is not a subset of AFA (aligned FASTA); they only overlap in a few trivially small cases. In A2M, deletions are -, consensus positions are upper case, insertions are lower case, and pad characters (.) for alignment are removed. So in A2M, sequences are usually different lengths (but each one has the same number of -+upper case). In aligned fasta, case doesn't matter and any of several characters including ., - can be used as gaps, and every sequence has the same number of total chars. If a file could be either A2M or AFA (if it consists of ungapped identical-length upper-case sequences, for example), the Easel format autodetector calls it AFA. In fact, the format autodetector will only try to guess A2M format if the file has a .a2m file suffix, for fear of confusing an AFA file; otherwise A2M always has to be specified with --qformat a2m. (See EddyRivasLab/easel@71f1750 https://github.com/EddyRivasLab/easel/commit/71f175012549a80296bf6c2b49ce27c5b0c06eee for example)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/pull/194#issuecomment-651743953, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZHE3HGST3SHP2YLSCTRZHGWBANCNFSM4OHZMRGQ .

cryptogenomicon commented 4 years ago

That direction doesn't work. You can treat an MSA as a set of unaligned sequences (and read them one at a time, thru Easel's unaligned seq reading functions), but you can't treat a set of unaligned sequences as an MSA. You'd need to infer a multiple alignment first.

npcarter commented 4 years ago

In that case, I'm confused. In the pull request, if you pass nhmmer a FASTA file whose sequences are all the same length, it returns an error telling you that you need to specify the file format. I thought that that was because nhmmer can't auto-detect the difference between a file with N independent sequences that happen to be the same length and one containing an alignment in FASTA (or other sequence) format.

If you pass --qformat=fasta, nhmmer treats the file as N independent sequences. It seemed to me that there should be a corresponding --qformat flag to tell nhmmer to treat the file as an alignment, because otherwise why do we need the user to specify the format in the case where the file contains independent sequences?

Is the answer that some of the alignment files can't be distinguished from FASTA files automatically?

-Nick

On Tue, Jun 30, 2020 at 12:43 PM Sean R. Eddy notifications@github.com wrote:

That direction doesn't work. You can treat an MSA as a set of unaligned sequences (and read them one at a time, thru Easel's unaligned seq reading functions), but you can't treat a set of unaligned sequences as an MSA. You'd need to infer a multiple alignment first.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/pull/194#issuecomment-651912705, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZE5K6NHU4BW4J7XWZDRZII2NANCNFSM4OHZMRGQ .

cryptogenomicon commented 4 years ago

I think the confusion is just that FASTA format isn't an alignment file; aligned FASTA is. --qformat=fasta, FASTA format, unaligned sequences, sequences can be of any lengths, no gap chars; --qformat==afa, aligned FASTA format, an MSA, all sequences must be identical length in chars including gap characters.

traviswheeler commented 4 years ago

Regarding Nick's (reasonable) afa/fasta confusion: this is what motivated the updated error message committed in https://github.com/EddyRivasLab/hmmer/pull/194/commits/e65bff1f8ecbafd542a648c797fed196cf041da5. When nhmmer sees an input that might be afa and might be fasta, it fails with the new message:
Error: Query file type could be either aligned or unaligned; please specify (--qformat [afa|fasta])

npcarter commented 4 years ago

Ah, great. Yes, that does sound like a good way to handle this.

-Nick

On Tue, Jun 30, 2020 at 2:52 PM Travis Wheeler notifications@github.com wrote:

Regarding Nick's (reasonable) afa/fasta confusion: this is what motivated the updated error message committed in e65bff1 https://github.com/EddyRivasLab/hmmer/commit/e65bff1f8ecbafd542a648c797fed196cf041da5. When nhmmer sees an input that might be afa and might be fasta, it fails with the new message: Error: Query file type could be either aligned or unaligned; please specify (--qformat [afa|fasta])

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/pull/194#issuecomment-651979307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZAGSMGHK6PUEORNV6LRZIX6PANCNFSM4OHZMRGQ .

traviswheeler commented 4 years ago

I've pushed a commit (https://github.com/TravisWheelerLab/hmmer/commit/4ca1e16246fa2b3f26d21d438b054f9d4ecabe7c) that allows the user to do as Sean suggests: saying "the input is an msa format, but treat it as a source of individual sequences". The flag is --qsingle_seqs. I'd missed that phmmer allows this, and agree that it's proper.

I've followed up with one (https://github.com/EddyRivasLab/hmmer/pull/194/commits/31214231e3b5e45d4704d2922480ffcd18eeebb4) that handles the one confusing case that I found when testing this: when the query is an afa containing gap characters, but the user insists it should be treated as containing single sequences (--qformat fasta ... or --qsingle_seqs). By default, those would cause an error complaining about gap characters. The code now does what the user expects to happen (ignores the gap characters).

cryptogenomicon commented 4 years ago

clarification: it's the Easel unaligned-sequence-reading interface itself (not only phmmer) that will automatically read an alignment file as a set of individual sequences, if you pass an MSA file to unaligned sequence open/read calls. If the user tries to assert that an AFA alignment file (with gaps) is a FASTA file, it's correct to complain with an error. However, --qsingle_seqs should not cause an error; that should instead go into a code path that opens the file as an unaligned seq source, and that will handle an AFA (or other alignment file) fine.

npcarter commented 4 years ago

Travis, could you update the man page with all of the recent changes? I want to be able to come at the testing from the point of view of a new user, and also am having trouble tracing through the various commit messages to piece together how everything should work.

Also, something that may have gotten lost: do we want a format that specifies that the input file is an HMM? This would mostly be for checking as part of a pipeline, IMO.

traviswheeler commented 4 years ago

Good timing - I was just working on that documentation, along with a minor update to match Sean's expectation that nhmmer should raise an error if given an msa but told the format is fasta (this is overridden by --qsingle_seqs). See recent commits.

Regarding having something like a --qhmm flag to indicate that the query is an hmm: I can add one if you think it's preferable, but have intentionally avoided doing so. That (searching with hmm query) is the default, and the first thing we try to autodetect. Format flags are used to override that default. I feel like it's superfluous to give the user a separate flag to say "do the normal thing", and hesitated to add another "type" (e.g. for --qformat) to easel's current crop of aligned/unaligned formats (not least because it would mean changing the existing rules for ranges (sqio: <=100, msa: >100, per ~line 100 esl_sqio.h)

cryptogenomicon commented 4 years ago

I agree with that.

npcarter commented 4 years ago

Travis,

Just wanted to check whether the version on GitHub had all of your planned changes or if you're still working, to avoid spending time testing code with known issues.

-Nick

On Wed, Jul 1, 2020 at 1:25 PM Travis Wheeler notifications@github.com wrote:

Good timing - I was just working on that documentation, along with a minor update to match Sean's expectation that nhmmer should raise an error if given an msa but told the format is fasta (this is overridden by --qsingle_seqs). See recent commits.

Regarding having something like a --qhmm flag to indicate that the query is an hmm: I can add one if you think it's preferable, but have intentionally avoided doing so. That (searching with hmm query) is the default, and the first thing we try to autodetect. Format flags are used to override that default. I feel like it's superfluous to give the user a separate flag to say "do the normal thing", and hesitated to add another "type" (e.g. for --qformat) to easel's current crop of aligned/unaligned formats (not least because it would mean changing the existing rules for ranges (sqio: <=100, msa: >100, per ~line 100 esl_sqio.h)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/pull/194#issuecomment-652549764, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZGC53V5PTMIVKAXZ2LRZNWP7ANCNFSM4OHZMRGQ .

traviswheeler commented 4 years ago

Smart to check - I'm not planning to add anything more to this PR.

npcarter commented 4 years ago

I'll start a test pass, then.

On Thu, Jul 2, 2020 at 1:10 PM Travis Wheeler notifications@github.com wrote:

Smart to check - I'm not planning to add anything more to this PR.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/EddyRivasLab/hmmer/pull/194#issuecomment-653125425, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDJBZFM4AQOQRPOJSPXWDTRZS5RBANCNFSM4OHZMRGQ .

npcarter commented 4 years ago

Current round of tests: genbank, embl formats work as expected.

FASTA format: (note that this is a DNA fasta file that works fine without a --qformat flag or with --qformat=fasta) src/nhmmer --qformat=stockholm --qsingle_seqs ~/Desktop/oneseq.fasta tutorial/dna_target.fa

Error: Invalid alphabet type in query for nhmmer. Expect DNA or RNA

I'm pretty much out of patience at this point. This is CS 101 input combination testing, and expecting me to do it over and over again instead of doing it before you submit the pull request is unreasonable.

cryptogenomicon commented 4 years ago

Take a deep breath, I'll send a rescue dog with the little barrel of good scotch around his neck, we'll figure it out

traviswheeler commented 4 years ago

Having taken a deep breath: I've tested a few dozen argument/input combinations, but clearly had not tested combinations like the one you've highlighted (a --qformat flag in opposition to the input format, paired with the --qsingle_seqs flag). I hope you'll accept my apology for the oversight.

These inaccurate errors happen because nhmmer depends on early identification of the sequence alphabet, and the various format-specific variants of esl_sqfile_GuessAlphabet() do not catch the discrepancy between directed and actual format. I've solved the problem in esl_msafile_stockholm_GuessAlphabet() and esl_msafile_afa_GuessAlphabet(), but not yet in other parsers. If you'd like, I can submit a PR to Easel's develop branch with what I have - I'm not sure I'll have time to understand the nuances of the various parsers.

cryptogenomicon commented 4 years ago

I don't understand that last bit. You have to either guess or assert the file format first; given the file format, you have to guess or assert the alphabet. There should be no ambiguity about the format by the time you get to alphabet-determination ("do not catch the discrepancy between directed and actual format"), because you can't determine the alphabet without knowing which letters are sequence, not annotation. See the example at the end of esl_sqio.c for the typical pattern.

traviswheeler commented 4 years ago

Sorry to have been confusing. Maybe this will better explain what I mean when I say "format-specific variants of esl_sqfile_GuessAlphabet() do not catch the discrepancy between directed and actual format":

In the case that Nick caught, we've been told that the file is in stockholm format (--qformat=stockholm), but it's actually a FASTA file. Before reading in the first sequence, we guess the alphabet. This calls esl_msafile_stockholm_GuessAlphabet() because of the --qformat flag. That function doesn't verify that the file is in the correct format; it just moves forward under the assumption that the file is in stockholm format, removing "blank lines, annotation, comments", and counting the rest (which includes FASTA header lines). It can't figure out what the alphabet is (because it's including non-sequence characters in its tally), so nhmmer dies there.

A similar situation exists with esl_msafile_afa_GuessAlphabet(), e.g. when "--qformat fasta" is paired with a file that's actually in stockholm format. Because esl_msafile_afa_GuessAlphabet() is assuming an AFA format, it sweeps up non-sequence lines that don't begin with ">", and uses those lines to guess the alphabet.

After reflection, I don't yet have a good solution to this.

cryptogenomicon commented 4 years ago

So long as it does die there, I'm ok with that. The user uses an expert option to assert the file format wrongly; bad move on their part; the alphabet guesser then fails and the error message is (or could be) "couldn't figure out the alphabet of your XXX-format file", causing expert user to go "oh, I see". So maybe just improve the error message from Error: Invalid alphabet type in query for nhmmer. Expect DNA or RNA to include the format, to clue the expert into their mistake.

Or instead if the alphabet-guesser guesses the wrong alphabet somehow (guessing that a protein file is DNA), then tries to read it as DNA, that should fail because the protein sequences can't be digitized as DNA, and the error is (or could be) "your file doesn't seem to contain DNA sequences". (The opposite - reading a DNA file as protein - would be more problematic, that's harder to catch, since ACGT look like amino acids too, but that's not nhmmer's problem.)

In both cases the more direct error message would be "according to my format autodetection, looks like you asserted the wrong format, pal", but since the purpose of the --informat option is for an expert to override the format autodetection in the first place, I'm ok with the errors being a little more of the "hey, supposed expert, somehow you've put me in some sort of wedged state" slightly-more-cryptic variety.

But does this explain all the problems Nick was seeing? We should still be getting half-decent errors off of running nhmmer on Swissprot, and I thought we weren't.

traviswheeler commented 4 years ago

Two parts to this: (1) About "how should nhmmer die if the user asserts a format that doesn't match the file": I agree with your suggestion of bolstering the message to essentially say "couldn't figure out the alphabet of your XXX-format file". I'll get that committed some time today. (2) About the "nhmmer accepts protein searches": that's fixed in the commit I pushed earlier today

traviswheeler commented 4 years ago

Quick note to say: I think my most recent commit puts the code in good shape for giving more suggestive errors across a range of erroneous-arguments-and-formats, but I wouldn't suggest testing yet. I'll want to run through more tests tomorrow before calling it "ready".

traviswheeler commented 4 years ago

After the most recent commit (c36338699e1e2b81d3a9564e1f894343e8cf2b65), nhmmer passes all my tests. In some cases, the errors don't provide total clarity about the cause of the problem, e.g.

  % nhmmer --qformat afa seqs6.stk seqs6.fasta
  Error: Error reading msa from the aligned FASTA-formatted file seqs6.stk (7)

... but there's enough information there to hopefully guide the user to quit incorrectly asserting file format.

Note: a recent commit (9b940fe2b3bccf1c37acb2f488d9ade74b7eec88) resolved issue #196 (nhmmer accepting protein sequence inputs w/o error)

npcarter commented 4 years ago

Valgrind is finding some memory leaks in this version. I ran valgrind --leak-check=full src/nhmmer ~/Desktop/oneseq.fasta tutorial/dna_target.fa

and got: ==127671== ==127671== HEAP SUMMARY: ==127671== in use at exit: 5,013 bytes in 14 blocks ==127671== total heap usage: 23,327 allocs, 23,313 frees, 23,464,484 bytes allocated ==127671== ==127671== 5,013 (416 direct, 4,597 indirect) bytes in 1 blocks are definitely lost in loss record 14 of 14 ==127671== at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==127671== by 0x1878DC: msa_create_mostly (in /home/npcarter/hmmer/h3/src/nhmmer) ==127671== by 0x189FC7: esl_msa_CreateDigital (in /home/npcarter/hmmer/h3/src/nhmmer) ==127671== by 0x19495A: esl_msafile_afa_Read (in /home/npcarter/hmmer/h3/src/nhmmer) ==127671== by 0x1919B4: esl_msafile_Read (in /home/npcarter/hmmer/h3/src/nhmmer) ==127671== by 0x1103E0: main (in /home/npcarter/hmmer/h3/src/nhmmer) ==127671== ==127671== LEAK SUMMARY: ==127671== definitely lost: 416 bytes in 1 blocks ==127671== indirectly lost: 4,597 bytes in 13 blocks ==127671== possibly lost: 0 bytes in 0 blocks ==127671== still reachable: 0 bytes in 0 blocks ==127671== suppressed: 0 bytes in 0 blocks ==127671== ==127671== For lists of detected and suppressed errors, rerun with: -s ==127671== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

npcarter commented 4 years ago

I'm seeing a difference between HMM and afa file searches that I'm not sure whether is a bug or because the different formats have different information. Res0.txt (attached) is the result of searching tutorial/MADE1.hmm against tutorial/dna_target.fa. Res1.txt is the result when I use esl_reformat to convert MADE1.hmm into an afa file and repeat the search. There are some lines that are mostly "x" that are present in the HMM search but not the afa search.

res0.txt res1.txt

npcarter commented 4 years ago

Testing alignment file types: HMM and Stockholm work as expected.

Afa sees the issue mentioned above, and one minor weirdness when you pass --qformat=fasta on an afa file. With --qsingle_seq, it seems to run the search just fine, treating each sequence in the alignment as a separate search. Without --qsingle_seq, I get an error of "Line 2: illegal character ." I'm guessing that this is because inserts aren't allowed in unaligned FASTA, but the individual sequence reader knows how to parse them.

a2m works as expected

npcarter commented 4 years ago

Verified that this version detects protein target database and complains.

cryptogenomicon commented 4 years ago

On the MADE1.hmm => afa note above, you must mean MADE1.sto (the alignment); an HMM isn't convertable to an alignment file.

The x's are ok. That's the reference annotation line (RF), which is present in Stockholm format, but not in aligned FASTA format; you stripped off the annotation when you converted to .afa.

npcarter commented 4 years ago

Sequence file format checks: fasta works as expected embl works as expected genbank works as expected

npcarter commented 4 years ago

Answering Sean's comment: looking back, yes, I went from Stockholm->afa, and thanks for the explanation.

npcarter commented 4 years ago

Verified that the latest change addresses the leak Valgrind found. At this point, all of my issues have been addressed.

traviswheeler commented 4 years ago

Thanks for running this through the paces, and putting up with the back-and-forth.

Addressing this also:

... when you pass --qformat=fasta on an afa file. With --qsingle_seq, it seems to run the search just fine, treating each sequence in the alignment as a separate search. Without --qsingle_seq, I get an error of "Line 2: illegal character ."

This is following the guidance from Sean on June 30:

If the user tries to assert that an AFA alignment file (with gaps) is a FASTA file, it's correct to complain with an error. However, --qsingle_seqs should not cause an error; that should instead go into a code path that opens the file as an unaligned seq source, and that will handle an AFA (or other alignment file) fine.

I agree that it's a little confusing, but I'm not sure there's a non-confusing way to deal with all the conflicting information.

cryptogenomicon commented 4 years ago

Thanks both!

EddyRivasLab / hmmer

Fix for iss#171: guessing query file format #194