EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
305 stars 69 forks source link

HMMER3 fails to detect a valid flanking hit. #290

Closed Thernn88 closed 1 year ago

Thernn88 commented 1 year ago

Hello,

I have an edge case issue where HMMER fails to recognize valid hits in the flanking regions of a HMM profile. I've tried tweaking parameters to no avail. (Or is there something I've overlooked?)

In the link below is a test.fa in which I isolated the translations of the problem sequence along with a pHMM and the original (full) output file. The sequence >NODE_124875_length_280 is not recovered when it should be.

BlastX, conversely, does recover this sequence as a hit. However, BlastX is ~11x slower and not a viable approach on larger files.

Let me know if there is any additional information or files I can provide.

test.zip

cryptogenomicon commented 1 year ago

I need more information. What do you think the significant alignment should be (what part of what sequence should be hit, by what part of the profile) and why do you think that? It's not immediately obvious that the parent sequence read contains a coding region here.

One immediate observation: your translated ORFs contain '' characters, which are not legal amino acid characters. HMMER is working around that by treating them as X's. This is not optimal - HMMER is designed to take individual protein sequences as input, and the length of the target sequences is part of what it models. If you do six-frame translation of a DNA sequence, it's better to do that as all separate ORFs, rather than six "full length" ORFs with 's.

Thernn88 commented 1 year ago

Thanks for the quick reply! Aligning the six translations against the references reveals two translations overlap by 28 AAs. Attached is a file with these two aligned against the references. Hopefully, this is helpful.

EOG091G0703.aa.zip

I'm using HMMER outside of the box. To give some background, I'm finalizing development of a bioinformatics pipeline which, in short, takes raw reads, translates them, and scores my reference profiles against the read translations. Afterwards, we clean the reads, remove outliers/contaminants and only then "merge" them.

Happy holidays!

cryptogenomicon commented 1 year ago

I'm sorry, that file is not an alignment; it's a FASTA file containing a concatenation of 71 sequences (from the EOG091G0703 family, I presume?) with the six translated frames of your read, 77 total unaligned sequences. Maybe you attached the wrong file?

I still don't understand where you're expecting the significant alignment to be (which piece of which target ORF sequence, to which part of the profile), and why you think it's correct. Unable to follow up on this unless I understand that first.

Thernn88 commented 1 year ago

Apologies, that was my mistake. I uploaded the unaligned version.

I've uploaded the proper file this time. I made a small change in which I clustered the most similar reference with the two candidates.

Screenshot 2022-12-25 142241

EOG091G0703.aa.zip

I'm expecting an alignment from position 95 onwards for a total overlap of 28 AAs.

cryptogenomicon commented 1 year ago

You can use the --max option to turn off the heuristic filters, which increases sensitivity at the expense of speed. You can also change the default E-value cutoff of 10 with the -E <x> option to set it to something less stringent, like 1000. So for example:

% hmmsearch --max -E 1000 EOG091G0703.hmm bug.fa
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# query HMM file:                  EOG091G0703.hmm
# target sequence database:        bug.fa
# sequence reporting threshold:    E-value <= 1000
# Max sensitivity mode:            on [all heuristic filters off]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query:       EOG091G0703  [M=644]
Scores for complete sequences (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Sequence                              Description
    ------- ------ -----    ------- ------ -----   ---- --  --------                              -----------
    1.4e-05    9.7   0.0    1.5e-05    9.6   0.0    1.0  1  NODE_124875_length_280|[translate(1)]  

Domain annotation for each sequence (and alignments):
>> NODE_124875_length_280|[translate(1)]  
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !    9.6   0.0   1.5e-05   1.5e-05       1      27 [.      66      93 .]      66      93 .] 0.95

  Alignments for each domain:
  == domain 1  score: 9.6 bits;  conditional E-value: 1.5e-05
                            EOG091G0703  1 Mgllsdpsarq.kkltklllklnkklcv 27
                                           Mgll++ps +q +k+  +l+k+++k+c+
  NODE_124875_length_280|[translate(1)] 66 MGLLTNPSLNQrSKFGGILIKHKRKICF 93
                                           ********98879**************7 PP

Internal pipeline statistics summary:
-------------------------------------
Query model(s):                            1  (644 nodes)
Target sequences:                          1  (93 residues searched)
Passed MSV filter:                         1  (1); expected 1.0 (1)
Passed bias filter:                        1  (1); expected 1.0 (1)
Passed Vit filter:                         1  (1); expected 1.0 (1)
Passed Fwd filter:                         1  (1); expected 1.0 (1)
Initial search space (Z):                  1  [actual number of targets]
Domain search space  (domZ):               1  [number of targets reported over threshold]
# CPU time: 0.00u 0.00s 00:00:00.00 Elapsed: 00:00:00.00
# Mc/sec: 29.48
//
[ok]