EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
307 stars 69 forks source link

dropping columns that have only one sequence in it can fail at the N-terminus. #331

Open danielhhaft opened 1 month ago

danielhhaft commented 1 month ago

I aligned 38 sequences, average length 72, but including one outlier sequence with a length of 505 and C-terminal homology.

Same thing happens with muscle or belvu. Alignment is long , but I expect the alignment to be short because I expect gappy N-terminal columns to disappear the way gappy internal columns disappear.

But hmmbuild (two versions, most recent 3.4 from Aug 2023) did this:

input alignment file: foo.afa

output HMM file: foo.HMM

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

idx name nseq alen mlen eff_nseq re/pos description

---- -------------------- ----- ----- ----- -------- ------ -----------

1 foo 38 538 508 1.38 0.591

See how the model length is 508, very few columns dropped?

HMMER_BUILD_ERROR.msf.txt

I attached an example of a file (multifasta format) that results in the incorrect HMM length.

I've been building HMMs since June 1, 1998. I think I always trim the N-terminus and C-terminus until column density exceeds 50%, so I don't think any model of mine has been corrupted.

But I have concerns for anyone who builds an alignment without trimming. A single long N-terminal extension can end up contributing 80 percent of the total score.

cryptogenomicon commented 1 month ago

Hi Dan,

This is working as designed, actually - it's the result of the sequence fragment definition rule in hmmbuild. The short sequences are getting defined as fragments relative to the long sequence, so the long sequence is driving the consensus here. If you want to get the behavior you were expecting instead (i.e. the short sequences are the consensus, the long sequence has dangly terminal insertions), turn off sequence fragment definition with --fragthresh 0.

In more detail...

Here's some examples from my notes:

We need fragment definition because of cases like this:

seq1 ACDEF---------------
seq2 -----GHIKL----------
seq3 ----------MNPQR-----
seq4 ---------------STVWY

where the consensus is ACD..VWY but all we have are short-read metagenomic fragments of it, for example. (A case like this was reported by JCVI on an alpha version of HMMER3 that lacked sequence fragment definition - maybe that was you, even? I don't recall now.) When a sequence is a fragment, we don't want to count its termini as real deletions relative to consensus.

Then we have cases like yours:

seq1 ACDEFGHIKLMNPQRSTVWY
seq2 ----------MNPQR-----
seq3 ----------MNPQR-----
seq4 ----------MNPQR-----

It's ambiguous whether the user expects seq2-4 to be considered to be fragments relative to seq1 (perhaps they're from metagenomic short reads), or whether the user expects seq1 to have long dangly terminal insertions relative to seq2-4.

The HMMER3 definition rule is currently by "column coverage": the raw unaligned sequence length L, divided by the alignment length in columns C. If L/C <= threshold t (default 0.5), the sequence is called a fragment. So here seq2-4 are called fragments. And that's what's happening in your example too.

The code has had sequence fragment definition since 2008 (!), and the most recent change in the heuristic rule was in 2011, so it's been this way for a long time.