Very long gap in an alignment, due to extreme mononlucleotide repeat in profile

traviswheeler commented 4 years ago

The attached hmm and target sequence (both with .txt extension to force github to allow them to be uploaded) pair to yield a >9Kb long gap in an alignment.

target.fa.txt DR0224948.hmm.txt

% nhmmer DR0224948.hmm target.fa > test.out

produces a single alignment with these summary stats:

>> DR0054876.1/3700-4000
     score  bias    Evalue   hmmfrom    hmm to     alifrom    ali to      envfrom    env to       sq len      acc
    ------ ----- ---------   -------   -------    --------- ---------    --------- ---------    ---------    ----
  !   97.6   1.4     7e-35      3319     12518 ..       106       182 ..        86       201 ..       301    0.41

In other words: >9Kb of model aligns to 76 target nucleotides - the alignment is basically just one big gap.

The query hmm contains several very long runs of G, including one that's almost 4KB long. That's a strange model, but still shouldn't produce this kind of alignment. I haven't identified a fix yet, but want to be sure this is documented, and am planning to investigate.

cryptogenomicon commented 4 years ago

I've seen this sort of thing when an insert-insert transition parameter is 0, which can happen after rounding off to a scaled integer. There's code to prevent this: scaled int insert parameters are bounded at -1, so there's always a penalty for an insertion. Maybe nhmmer is somehow overriding that?

traviswheeler commented 4 years ago

Poking around a bit - this is not specific to nhmmer. I get the same crazy deletion running hmmsearch:

% hmmsearch  DR0224948.hmm target.fa  
.
.
.
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !  104.8   0.0   1.6e-35   1.6e-35    3319   12517 ..     106     181 ..      86     201 .. 0.41

That's >9000 model states aligned to 75 target sequence characters, which means a whole lot of delete states are getting used. In the HMM, all the delete transitions are the same, roughly 33% and 67%:

    d->m     d->d
   1.09861  0.40547

What I see: p7_domaindef_ByPosteriorHeuristics() identifies a single envelope on the target sequence (86..201). Based on the end of the alignment, this seems probably reasonable (there's a really good alignment from 107..181).

Then in rescore_isolated_domain(), p7_OptimalAccuracy() is asked to fill in the 18309x116 DP matrix (the entire model vs this little target sequence envelope), and then p7_OATrace() traces through those ~75 p7T_M states (sequence positions 181 back to 107) ... followed by ~9000 p7T_D states, then picking up one last p7T_M state. The DP matrix fill+trace is complicated enough that it's a struggle for me to follow - I'm not sure if it's a problem in how ox is being filled, or in how OATrace is reading through that matrix.

At this point, I've run out of juice, and won't have the free time to hunt it any further for a while. Since this is generic to p7_domaindef_ByPosteriorHeuristics(), and impacts hmmsearch (and presumably phmmer) in addition to nhmmer, maybe it'll make its way onto your to-do list?

cryptogenomicon commented 1 year ago

I've looked into this (at last). This is an unusual new case where H3's domain definition heuristics are failing. I won't be able to fix this in HMMER3, but I believe it has been solved by new algorithms that will be in HMMER4. I'm going to leave the issue open, tag it for H4, and we'll revisit it in the future.

The problem arises from the fact that the profile here consists of ~30 multiple tandem repeats of ~380nt (plus a few kb of polyG; the polyG mononuc repeats are a red herring, they are not the problem). The target consists of a piece of one repeat. HMMER isn't designed for this; it's designed for the reverse, for your profile to be one "domain" or repeat, and for the target to consist of one or more domains/repeats. It would usually still work anyway (some Pfam models consist of two tandem domains, to increase signal/noise on some short tandem domain families), but this is a major factor in the problem. So one workaround is to define your profile to be a single ~380nt repeat.

In more detail: the target aligns to this profile in ~30 different places. The marginalized posterior probability matrix, projected onto the target, says that the target definitely has a homologous region (an "envelope"). But the model is uncertain where the alignment is on the profile; with the posterior probability spread over ~30 possible alignments, all places have low posterior probability. You can see this uncertainty in the per-residue posterior probabilities, which are very low (2's; ~20% probable). A Viterbi optimal alignment would picks the highest-scoring possibility. But we do "optimal accuracy" alignment, not Viterbi, by default; OA looks for an alignment that has the greatest sum of posterior decoding probabilities. Because there's ~30 different places where the target envelope can align more or less as well to the profile, it just happens that in this particular example it can cobble together an OA alignment that picks up the first part of a repeat at one place in the profile and the rest of the repeat at a distant place in the profile. The result is a stupid alignment, but it is a "correct" consequence of the H3 domain definition heuristics interacting with OA alignment, as I defined the procedure - i.e. it's not a "bug" per se, I can't do anything about it without changing the algorithm. And that's what H4 will do. One of the main goals of H4 has been to get rid of the domain definition heuristics, which I've seen fail on some other rare cases too.

Biologically, your profile is interesting and unusual. Although you have it annotated as a "putative TE family", this is not likely to be a transposon; it's an array of at least 30 tRNA genes, consisting of 4+ repeats of an Arg/iMet{x5-6} unit of an Arg tRNA followed by 5-6 iMet tRNAs. tRNAscan-SE output on the 18309nt consensus of your profile:

Sequence                    tRNA    Bounds  tRNA    Anti    Intron Bounds   Inf       
Name                    tRNA #  Begin   End     Type    Codon   Begin   End Score   Note
--------                ------  -----   ------  ----    -----   -----   ----    ------  ------
DR0224948-consensus     1   53      124     iMet    CAT 0   0   60.4    
DR0224948-consensus     2   438     509     iMet    CAT 0   0   60.4    
DR0224948-consensus     3   815     886     iMet    CAT 0   0   60.4    
DR0224948-consensus     4   1235    1307    Arg CCT 0   0   67.5    
DR0224948-consensus     5   1314    1385    iMet    CAT 0   0   60.4    
DR0224948-consensus     6   1691    1762    iMet    CAT 0   0   60.4    
DR0224948-consensus     7   2068    2139    iMet    CAT 0   0   60.4    
DR0224948-consensus     8   2443    2514    iMet    CAT 0   0   58.4    
DR0224948-consensus     9   2820    2891    iMet    CAT 0   0   60.4    
DR0224948-consensus     10  3240    3312    Arg CCT 0   0   67.5    
DR0224948-consensus     11  3319    3390    iMet    CAT 0   0   60.4    
DR0224948-consensus     12  3694    3765    iMet    CAT 0   0   60.4    
DR0224948-consensus     13  4071    4142    iMet    CAT 0   0   60.4    
DR0224948-consensus     14  4457    4528    iMet    CAT 0   0   60.4    
DR0224948-consensus     15  4847    4918    iMet    CAT 0   0   60.4    
DR0224948-consensus     16  5224    5295    iMet    CAT 0   0   60.4    
DR0224948-consensus     17  10058   10130   Arg CCT 0   0   67.5    
DR0224948-consensus     18  10137   10208   iMet    CAT 0   0   60.4    
DR0224948-consensus     19  10521   10592   iMet    CAT 0   0   60.4    
DR0224948-consensus     20  10917   10988   iMet    CAT 0   0   60.4    
DR0224948-consensus     21  11289   11360   iMet    CAT 0   0   60.4    
DR0224948-consensus     22  11664   11735   iMet    CAT 0   0   47.4    
DR0224948-consensus     23  12041   12112   iMet    CAT 0   0   51.0    
DR0224948-consensus     24  12361   12435   Arg CCT 0   0   33.5    pseudo
DR0224948-consensus     25  12442   12513   iMet    CAT 0   0   60.4    
DR0224948-consensus     26  16593   16664   iMet    CAT 0   0   60.4    
DR0224948-consensus     27  16971   17042   iMet    CAT 0   0   60.4    
DR0224948-consensus     28  17348   17419   iMet    CAT 0   0   60.4    
DR0224948-consensus     29  17720   17791   iMet    CAT 0   0   60.4    
DR0224948-consensus     30  18105   18176   iMet    CAT 0   0   47.4

The metadata on the profile say there were 164 sequences in the alignment, all from one genome assembly of the redlip mullet Planiliza haematocheilus (a teleost fish; GCA_005024645.1). Are there really 164 copies of this ~18kb unit in that genome? That'd be ~4800 tRNAs; nobody needs that many iMet tRNAs, and (assuming the assembly is correct) there must be an interesting reason they're there.

traviswheeler commented 1 year ago

Thanks - the role of OA in these shenanigans makes sense.

Your question about the tRNAs is a really good one. I'll pass it on to the developers of the model.

EddyRivasLab / hmmer

Very long gap in an alignment, due to extreme mononlucleotide repeat in profile #201