domain can't be detected in the longer sequence

AlisaGU commented 1 year ago

Hi, When I use hmmsearch to search domain in sequence 10M long, the program returned a blank list.

hmmsearch -o /dev/null --domtblout 10M.dom --noali KRAB.hmm a.fa

[test_10m]$ cat 10M.dom
#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
#
# Program:         hmmsearch
# Version:         3.3 (Nov 2019)
# Pipeline mode:   SEARCH
# Query file:      KRAB.hmm
# Target file:     a.fa
# Option settings: hmmsearch -o /dev/null --domtblout 10M.dom --noali KRAB.hmm a.fa 
# Current dir:     test_10m
# Date:            Sat Apr  8 16:38:39 2023
# [ok]

However, when the domain was against the subsequence of a.fa, a hit was found.

[test_10m]$ seqkit faidx a.fa 10m:9743413-9743433 > b.fa
[test_10m]$ cat b.fa 
>10m:9743413-9743433
EYQTDLYKTVMXQTYWNLWHL
[test_10m]$ hmmsearch -o /dev/null --domtblout 10M.b.dom -E 10000 --domE 10000 --noali /picb/evolgen/users/gushanshan/software/Hmmer/KRAB/KRAB.hmm b.fa
[test_10m]$ cat 10M.b.dom
#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
10m:9743413-9743433  -             21 KRAB                 PF01352.30    42   8.6e-09   21.3   0.2   1   1   8.6e-09   8.6e-09   21.3   0.2    21    41     1    21     1    21 0.91 -
#
# Program:         hmmsearch
# Version:         3.3 (Nov 2019)
# Pipeline mode:   SEARCH
# Query file:      KRAB.hmm
# Target file:     b.fa
# Option settings: hmmsearch -o /dev/null --domtblout 10M.b.dom --noali -E 10000 --domE 10000 /KRAB.hmm b.fa 
# Current dir:     /test_10m
# Date:            Sat Apr  8 16:40:43 2023
# [ok]

Do you know what the reason is? And, what should I do to detect this hit in a.fa?

Best regards,

cryptogenomicon commented 1 year ago

Because it's not statistically significant when it's in the much larger sequence. When you've got 10M-fold more places to find a domain, the E-value is ~10M-fold higher.

AlisaGU commented 1 year ago

Hi, the result was still empty, even the reported evalue was set to 10000000000.

 hmmsearch -o /dev/null --domtblout 10M.dom -E 10000000000 --domE 10000000000 --noali KRAB.hmm a.fa
[test_10m]$ cat 10M.dom 
#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
#
# Program:         hmmsearch
# Version:         3.3 (Nov 2019)
# Pipeline mode:   SEARCH
# Query file:      KRAB.hmm
# Target file:     a.fa
# Option settings: hmmsearch -o /dev/null --domtblout 10M.dom --noali -E 10000000000 --domE 10000000000 KRAB.hmm a.fa 
# Current dir:     test_10m
# Date:            Sat Apr  8 20:36:38 2023
# [ok]

cryptogenomicon commented 1 year ago

Because of the way the acceleration filter methods work in HMMER3, it is generally unable to identify insignificant high E-value hits, no matter what you set -E and --domE to (and you generally wouldn't want to identify a lot of false positives, either).

AlisaGU commented 1 year ago

Thanks for your quick answer. Actually, I used hmmer to search domains against a translated genome. It seems that domains were unlikely to be detected in a longer sequence. To get a more precise result, what is the optimal length to split genome for you?

cryptogenomicon commented 1 year ago

You should translate into proteins/ORFs. HMMER makes assumptions about what proteins typically look like, and how long they are. If you're just "translating" a long genome sequence into only six "open reading frames", with * for stops or something, that's not going to work well.

See http://cryptogenomicon.org/hmmscan-vs-hmmsearch-speed-the-numerology.html for some more information.

AlisaGU commented 1 year ago

This link you shared is especially useful when I choose the program between hmmscan and hmmsearch. Thanks for your advice~

EddyRivasLab / hmmer

domain can't be detected in the longer sequence #296