EddyRivasLab / hmmer

HMMER: biological sequence analysis using profile HMMs
http://hmmer.org
Other
305 stars 69 forks source link

--incdomT cannot control the hit significance #292

Closed ZhangDengwei closed 1 year ago

ZhangDengwei commented 1 year ago

Hi,

I am using jackhmmer to iteratively search my query sequences against UniRef100, with the command as follows:

jackhmmer -N 5 -o parameter_examination/out.jackhmmer_B20 -A parameter_examination/out.MSA_B20 --tblout parameter_examination/out.seq.tblout_B20 -T 20 --domT 20 --incT 20 --incdomT 20 --cpu 20 query.fa uniref100.fasta

I want to control the hit with domain bitscore > 20 instead of sequence bitscore > 20, but I found some significant hits have a sequence bitscore > 20 but domain score < 20, for example,

#                                                                  --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name           accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#   ------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
UniRef100_UPI001BE5AAAA -          Bac_001              -                 13   20.0   5.5        16   19.7   5.5   1.3   1   1   0   1   1   0   0 hypothetical protein n=1 Tax=unclassified Chryseobacterium TaxID=2593645 RepID=UPI001BE5AAAA
UniRef100_UPI0021AF61FD -          Bac_001              -                 13   20.0   0.1        42   18.3   0.1   1.8   1   1   0   1   1   0   0 lactococcin family bacteriocin n=1 Tax=Lactococcus lactis TaxID=1358 RepID=UPI0021AF61FD
UniRef100_A0A4V2F585    -          Bac_001              -                 13   20.0   1.0        18   19.5   1.0   1.3   1   0   0   1   1   0   0 RSAM-modified peptide n=1 Tax=Aquimarina brevivitae TaxID=323412 RepID=A0A4V2F585_9FLAO
UniRef100_UPI000D1338CB -          Bac_001              -                 13   20.0   8.3        28   18.9   8.3   1.6   1   1   0   1   1   0   0 hypothetical protein n=1 Tax=Chryseobacterium aurantiacum TaxID=2116499 RepID=UPI000D1338CB

Even though I only set --incT 20 with missing --incdomT 20, the output did not change. Is there any problem with my command? How can I control the hit according to domain's bitscore rather than sequence's bitscore? Any suggestions provided would be much appreciated!

cryptogenomicon commented 1 year ago

You're looking at a per-sequence tabular output. All these sequences have passed the per-sequence bitscore thresholds you set (-T and --incT).

Look at the --domtblout domain output, per-domain, and you'll see that none of the domains from these sequences have been reported or considered significant, because of your reporting (--domT) and inclusion (--incdomT) thresholds.

HMMER distinguishes between the score and significance of whole sequences vs. individual domains.

ZhangDengwei commented 1 year ago

Thanks for the interpretation. I still have a doubt about the hit inclusion in the first round when using the aforementioned command. I still found that some hits with domain bitscore < 20 were considered significant in the 1st iteration and used for the next round of searching. Did I misunderstand the output of out.jackhmmer_B20?

Scores for complete sequences (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Sequence                Description
    ------- ------ -----    ------- ------ -----   ---- --  --------                -----------
+   4.2e-05   37.1  33.0    2.3e+05    6.5   0.1   12.7  0  UniRef100_UPI001EE5AE5F  uncharacterized protein LOC124274692
cryptogenomicon commented 1 year ago

Same answer. That line is telling you about a sequence, not an individual domain. If you want to see which domains were flagged for inclusion, look at the domain section of the main output, or use --domtblout for tabular output.

See the User Guide for more information.