Open apcamargo opened 7 months ago
A bit of background before I give you the easy answer 😁
So there are two kind of thresholds in HMMER, reporting and inclusion. When you use bitscore cutoffs with the pipeline, you actually change the inclusion thresholds, not the reporting thresholds. That's why you sometimes get hits that are not "included", but all hits should have hit.reported == True
(otherwise they are not kept in the TopHits
, which as its name suggests, only keep the top hits, not all hits).
The reason why there are different thresholds for hits and domains is to account for multi-domain hits, where sometimes a hit passes the inclusion thresholds because the combination of domains do, however individual domains do not pass the threshold (this is reported in HMMER's output, sometimes you'll see "no domain above threshold" or something like that).
Now in your example, to only iterate on included hits, and for each hit on included domains, you can use the included
property to avoid a cascade of if/else:
for hit in hits.included:
for domain in hit.domains.included:
# ...
Thanks for the explanation, Martin! That makes sense.
However, I've seem at least a case where the hit was reported, but it had a single domain that wasn't. If having multiple domains is what makes hit with unreported domains being reported, how can a hit with a single unreported domain be reported?
Here's an example:
>A0A010SHV5 1-984
MLDITTISRQSLGKVVSYYADGADDYYAKDGGAMQWQGAGAEALGLSGEVEQARFRELLDGRISDSTKLMRTVKEADGKV
VSKERLGYDLTFSAPKGVSLQALVHGDASIIEAHDKAVAAAIREAERLSQARITVNKKTGTENTNNLVVAKFRHETSRAL
DPDLHTHAFVLNMTQRSDGEWRALKNDGVFNSSMFLGNVYKAELARELEKAGFQLRYERNGTFDLAHFSDEQIREFSSRS
QQIEAALAAKGLDRSTASYAEKNQAALATRDKKQGGIDREELRQVWLERSRALGIDYHSREWAGVGADAQGGRERNSAAT
PQIEKPLEYRADQVIEFAIKSLTERQAVIGQKELMDTALRHGYGALTIDDVRAGIERRAASGHLIREEPLYASQNPADGK
KGKAAEKARQEAPQLSRKEWVANLVRAGKSRSEAARLVDEGIRTGRLRQGENRFTTHIAQKREREVLQIERMGRGTVEPR
ISKEAAEAFLADRGLKAEQQASVMRIARTQNQFIGVQGFAGVGKSYMTVAAKDLLEANGYRVTSLAPYGSQVKALQAEGL
EARTLQSFLKARDKKIDSNTVVFIDEAGVIPARQMHEAMKTIEAAGARVVFLGDVSQTKAIEAGKPFEQLMKAGMETSRL
TDIQRQKDPQLLEAVKLAAEGKAKQSLPLVNEIHEIKEDGARYQAIVDAYASMPKAERDQALIITGTNASRIEINEGVRE
ALGLKGQGAEYPLLNRLDTTQAERRHSKYYGKGSIVVPEVDYKNGLQRGVQYVVLDTGPGNKLTVRGPEGETLQFTPARC
TKLSVYSVERTELAPGDQVKITRNDAEKDLANGDRFVVKEIHSDRVVLEGDKGRQVELDTASAMFVGLAYASTVHSAQGL
TCDKVLINLETQSRTTAKDVYYVAISRARHAAEIFTDNRQKLGAAISRLNAKAGALDIKQLQRHALERKGHDQAGKQKDA
AQKQQQGQQLPTKQPKEKGRAFGL
print("pfam\tprotein\tdomain_number\thit_included\tdomain_included")
with pyhmmer.easel.SequenceFile("test.faa", digital=True) as seq_file:
seqs = seq_file.read_block()
with pyhmmer.plan7.HMMFile("Pfam-A.hmm") as hmm_file:
for hits in pyhmmer.hmmsearch(hmm_file, seqs, bit_cutoffs="gathering"):
for hit in hits:
for i, domain in enumerate(hit.domains, 1):
print(
f"{hit.name.decode()}\t{hits.query_accession.decode()}\t"
f"{i}\t{hit.included}\t{domain.included}"
)
Output:
pfam protein domain_number hit_included domain_included
A0A010SHV5 PF13245.10 1 True True
A0A010SHV5 PF13245.10 2 True False
A0A010SHV5 PF13604.10 1 True True
A0A010SHV5 PF08751.15 1 True True
A0A010SHV5 PF08751.15 2 True False
A0A010SHV5 PF13538.10 1 True False
A0A010SHV5 PF01443.22 1 True True
A0A010SHV5 PF01443.22 2 True True
As you can see, PF13538.10
has a single domain that was not included, but the hit is. Why is that?
Also interested here!
Maybe useful - the domain.score (26.71) for PF1538.10
seems to be slightly deflated relative to the hit.score (27.91), with the gathering threshold for the profile being 27.4.
This behavior is consistent with using HMMER3 though, where similarly domain.score of 26.7 and the full sequence score is 27.9.
hmmsearch -E 10.0 --domtblout test_hmmer3.txt Pfam-A.hmm test.faa
Results in test_hmmer3.txt:
# --- full sequence --- -------------- this domain ------------- hmm coord ali coord env coord
# target name accession tlen query name accession qlen E-value score bias # of c-Evalue i-Evalue score bias from to from to from to acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
A0A010SHV5 - 984 AAA_19 PF13245.10 135 3.4e-12 32.9 0.8 1 2 1e-11 1e-11 31.3 0.3 8 130 508 618 499 623 0.89 1-984
A0A010SHV5 - 984 AAA_19 PF13245.10 135 3.4e-12 32.9 0.8 2 2 0.79 0.79 -3.9 0.0 32 62 687 718 684 731 0.70 1-984
A0A010SHV5 - 984 AAA_30 PF13604.10 191 1.5e-48 151.2 5.0 1 1 4.3e-48 4.3e-48 149.7 5.0 2 187 495 677 494 680 0.95 1-984
A0A010SHV5 - 984 DUF2563 PF10817.12 104 1.1e-05 12.3 1.8 1 2 0.74 0.74 -3.2 0.1 40 59 106 125 99 142 0.57 1-984
A0A010SHV5 - 984 DUF2563 PF10817.12 104 1.1e-05 12.3 1.8 2 2 2e-05 2e-05 11.4 0.3 35 97 861 928 850 935 0.79 1-984
A0A010SHV5 - 984 Herpes_Helicase PF02689.18 808 1.9e-07 15.4 0.1 1 2 0.036 0.036 -2.0 0.0 171 206 556 591 549 597 0.71 1-984
A0A010SHV5 - 984 Herpes_Helicase PF02689.18 808 1.9e-07 15.4 0.1 2 2 5.6e-07 5.6e-07 13.9 0.1 731 777 870 917 847 927 0.83 1-984
A0A010SHV5 - 984 PhoH PF02562.20 205 1.9e-09 23.4 1.3 1 3 2.2e-05 2.2e-05 10.2 0.1 6 39 497 531 493 543 0.77 1-984
A0A010SHV5 - 984 PhoH PF02562.20 205 1.9e-09 23.4 1.3 2 3 1.1e-05 1.1e-05 11.2 0.1 120 162 579 622 561 646 0.85 1-984
A0A010SHV5 - 984 PhoH PF02562.20 205 1.9e-09 23.4 1.3 3 3 0.28 0.28 -3.3 0.0 81 126 731 776 718 777 0.83 1-984
A0A010SHV5 - 984 PIF1 PF05970.18 223 2.3e-06 13.4 0.3 1 2 2e-05 2e-05 10.4 0.1 5 66 498 557 494 587 0.74 1-984
A0A010SHV5 - 984 PIF1 PF05970.18 223 2.3e-06 13.4 0.3 2 2 0.023 0.023 0.4 0.0 131 174 597 640 590 675 0.73 1-984
A0A010SHV5 - 984 SH3_13 PF18335.5 65 4.5e-07 15.9 0.0 1 1 1.4e-06 1.4e-06 14.3 0.0 23 57 809 844 794 852 0.79 1-984
A0A010SHV5 - 984 TrwC PF08751.15 283 9.4e-91 290.3 5.0 1 2 9.4e-91 9.4e-91 290.3 5.0 3 282 14 292 12 293 0.95 1-984
A0A010SHV5 - 984 TrwC PF08751.15 283 9.4e-91 290.3 5.0 2 2 0.072 0.072 -1.3 1.3 194 231 915 953 891 981 0.55 1-984
A0A010SHV5 - 984 UvrD_C_2 PF13538.10 52 8.5e-11 27.9 0.3 1 1 2e-10 2e-10 26.7 0.3 1 48 868 911 868 914 0.89 1-984
A0A010SHV5 - 984 Viral_helicase1 PF01443.22 228 2.4e-17 49.6 0.5 1 2 1e-08 1e-08 21.4 0.0 3 102 516 623 514 690 0.80 1-984
A0A010SHV5 - 984 Viral_helicase1 PF01443.22 228 2.4e-17 49.6 0.5 2 2 4.1e-10 4.1e-10 25.9 0.2 160 228 852 916 791 916 0.62 1-984
This is more of a question than a bug.
My understanding is that when the user set a cutoff, PyHMMER will not report hits that would have
hit.included == False
. However, the same is not applied at the domain level, so there are cases where the hit is reported because one of the domains passes the cutoffs, but domains that do not satisfy the cutoffs are also reported.For example:
Generates this output:
To be honest, I don't really know what determines if a domain is included or not, given that the cutoffs are applied to the hit. I expected that when a cutoff is applied, domains were also filtered.