althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
128 stars 12 forks source link

Apply cutoffs for domains #65

Open apcamargo opened 7 months ago

apcamargo commented 7 months ago

This is more of a question than a bug.

My understanding is that when the user set a cutoff, PyHMMER will not report hits that would have hit.included == False. However, the same is not applied at the domain level, so there are cases where the hit is reported because one of the domains passes the cutoffs, but domains that do not satisfy the cutoffs are also reported.

For example:

>P03865 1-315
MATTKLGNTKSASRAINYAEKRAEEKSGLNCDVDYAKSAFKQTRALYGKEDGIQAHTVIQSFKPGEVTPEQCNQLGLELA
EKIAPNHQVAVYTHTDKDHYHNHIVINSVDLETGKKYQSNKKQRDLVKKENDNICREHGLSVTERGIAKMRYTQAEKGIV
FDRDEYSWKDELRDLIENAKTHTSNLETFSEHLEEKGVGVKLRGETISYKPENENKWVRGRTLGSEYEKGAIDHEHERHQ
KQQREPEYADEFKINWDAVEQHTEQLKQRRVERAQETKQAHSKISSRDTRESENQRERAKGNNIRIERGDEGLSR
with open("hmmsearch_plasmid_hallmarks.tsv", "w") as fo:
    fo.write("query\thit\tstart\tend\tevalue\tdomain_score\tdomain_included\n")

    for p in Path(".").glob("*.faa"):
        with pyhmmer.easel.SequenceFile(p, digital=True) as seq_file:
            seqs = seq_file.read_block()

        with pyhmmer.plan7.HMMFile("Pfam-A.hmm") as hmm_file:
            for hits in pyhmmer.hmmsearch(hmm_file, seqs, bit_cutoffs="gathering"):
                for hit in hits:
                    for domain in hit.domains:
                        fout.write(
                            f"{hits.query_accession.decode()}\t"
                            f"{hit.name.decode()}\t"
                            f"{domain.env_from}\t"
                            f"{domain.env_to}\t"
                            f"{hit.evalue:.2E}\t"
                            f"{domain.score:.2f}\t"
                            f"{domain.included}\n"
                        )

Generates this output:

query        hit      start   end   evalue     domain_score   domain_included
PF03432.18   P03865   8       239   3.25E-90   288.01         True
PF03432.18   P03865   240     314   3.25E-90   -0.23          False

To be honest, I don't really know what determines if a domain is included or not, given that the cutoffs are applied to the hit. I expected that when a cutoff is applied, domains were also filtered.

althonos commented 7 months ago

A bit of background before I give you the easy answer 😁

So there are two kind of thresholds in HMMER, reporting and inclusion. When you use bitscore cutoffs with the pipeline, you actually change the inclusion thresholds, not the reporting thresholds. That's why you sometimes get hits that are not "included", but all hits should have hit.reported == True (otherwise they are not kept in the TopHits, which as its name suggests, only keep the top hits, not all hits).

The reason why there are different thresholds for hits and domains is to account for multi-domain hits, where sometimes a hit passes the inclusion thresholds because the combination of domains do, however individual domains do not pass the threshold (this is reported in HMMER's output, sometimes you'll see "no domain above threshold" or something like that).

Now in your example, to only iterate on included hits, and for each hit on included domains, you can use the included property to avoid a cascade of if/else:

for hit in hits.included:
    for domain in hit.domains.included:
        # ...
apcamargo commented 7 months ago

Thanks for the explanation, Martin! That makes sense.

However, I've seem at least a case where the hit was reported, but it had a single domain that wasn't. If having multiple domains is what makes hit with unreported domains being reported, how can a hit with a single unreported domain be reported?

Here's an example:

>A0A010SHV5 1-984
MLDITTISRQSLGKVVSYYADGADDYYAKDGGAMQWQGAGAEALGLSGEVEQARFRELLDGRISDSTKLMRTVKEADGKV
VSKERLGYDLTFSAPKGVSLQALVHGDASIIEAHDKAVAAAIREAERLSQARITVNKKTGTENTNNLVVAKFRHETSRAL
DPDLHTHAFVLNMTQRSDGEWRALKNDGVFNSSMFLGNVYKAELARELEKAGFQLRYERNGTFDLAHFSDEQIREFSSRS
QQIEAALAAKGLDRSTASYAEKNQAALATRDKKQGGIDREELRQVWLERSRALGIDYHSREWAGVGADAQGGRERNSAAT
PQIEKPLEYRADQVIEFAIKSLTERQAVIGQKELMDTALRHGYGALTIDDVRAGIERRAASGHLIREEPLYASQNPADGK
KGKAAEKARQEAPQLSRKEWVANLVRAGKSRSEAARLVDEGIRTGRLRQGENRFTTHIAQKREREVLQIERMGRGTVEPR
ISKEAAEAFLADRGLKAEQQASVMRIARTQNQFIGVQGFAGVGKSYMTVAAKDLLEANGYRVTSLAPYGSQVKALQAEGL
EARTLQSFLKARDKKIDSNTVVFIDEAGVIPARQMHEAMKTIEAAGARVVFLGDVSQTKAIEAGKPFEQLMKAGMETSRL
TDIQRQKDPQLLEAVKLAAEGKAKQSLPLVNEIHEIKEDGARYQAIVDAYASMPKAERDQALIITGTNASRIEINEGVRE
ALGLKGQGAEYPLLNRLDTTQAERRHSKYYGKGSIVVPEVDYKNGLQRGVQYVVLDTGPGNKLTVRGPEGETLQFTPARC
TKLSVYSVERTELAPGDQVKITRNDAEKDLANGDRFVVKEIHSDRVVLEGDKGRQVELDTASAMFVGLAYASTVHSAQGL
TCDKVLINLETQSRTTAKDVYYVAISRARHAAEIFTDNRQKLGAAISRLNAKAGALDIKQLQRHALERKGHDQAGKQKDA
AQKQQQGQQLPTKQPKEKGRAFGL
print("pfam\tprotein\tdomain_number\thit_included\tdomain_included")
with pyhmmer.easel.SequenceFile("test.faa", digital=True) as seq_file:
    seqs = seq_file.read_block()
    with pyhmmer.plan7.HMMFile("Pfam-A.hmm") as hmm_file:
        for hits in pyhmmer.hmmsearch(hmm_file, seqs, bit_cutoffs="gathering"):
            for hit in hits:
                for i, domain in enumerate(hit.domains, 1):
                    print(
                        f"{hit.name.decode()}\t{hits.query_accession.decode()}\t"
                        f"{i}\t{hit.included}\t{domain.included}"
                    )

Output:

pfam          protein       domain_number    hit_included    domain_included
A0A010SHV5    PF13245.10                1            True               True
A0A010SHV5    PF13245.10                2            True              False
A0A010SHV5    PF13604.10                1            True               True
A0A010SHV5    PF08751.15                1            True               True
A0A010SHV5    PF08751.15                2            True              False
A0A010SHV5    PF13538.10                1            True              False
A0A010SHV5    PF01443.22                1            True               True
A0A010SHV5    PF01443.22                2            True               True

As you can see, PF13538.10 has a single domain that was not included, but the hit is. Why is that?

raufs commented 3 weeks ago

Also interested here!

Maybe useful - the domain.score (26.71) for PF1538.10 seems to be slightly deflated relative to the hit.score (27.91), with the gathering threshold for the profile being 27.4.

This behavior is consistent with using HMMER3 though, where similarly domain.score of 26.7 and the full sequence score is 27.9.

hmmsearch -E 10.0 --domtblout test_hmmer3.txt Pfam-A.hmm test.faa

Results in test_hmmer3.txt:

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
A0A010SHV5           -            984 AAA_19               PF13245.10   135   3.4e-12   32.9   0.8   1   2     1e-11     1e-11   31.3   0.3     8   130   508   618   499   623 0.89 1-984
A0A010SHV5           -            984 AAA_19               PF13245.10   135   3.4e-12   32.9   0.8   2   2      0.79      0.79   -3.9   0.0    32    62   687   718   684   731 0.70 1-984
A0A010SHV5           -            984 AAA_30               PF13604.10   191   1.5e-48  151.2   5.0   1   1   4.3e-48   4.3e-48  149.7   5.0     2   187   495   677   494   680 0.95 1-984
A0A010SHV5           -            984 DUF2563              PF10817.12   104   1.1e-05   12.3   1.8   1   2      0.74      0.74   -3.2   0.1    40    59   106   125    99   142 0.57 1-984
A0A010SHV5           -            984 DUF2563              PF10817.12   104   1.1e-05   12.3   1.8   2   2     2e-05     2e-05   11.4   0.3    35    97   861   928   850   935 0.79 1-984
A0A010SHV5           -            984 Herpes_Helicase      PF02689.18   808   1.9e-07   15.4   0.1   1   2     0.036     0.036   -2.0   0.0   171   206   556   591   549   597 0.71 1-984
A0A010SHV5           -            984 Herpes_Helicase      PF02689.18   808   1.9e-07   15.4   0.1   2   2   5.6e-07   5.6e-07   13.9   0.1   731   777   870   917   847   927 0.83 1-984
A0A010SHV5           -            984 PhoH                 PF02562.20   205   1.9e-09   23.4   1.3   1   3   2.2e-05   2.2e-05   10.2   0.1     6    39   497   531   493   543 0.77 1-984
A0A010SHV5           -            984 PhoH                 PF02562.20   205   1.9e-09   23.4   1.3   2   3   1.1e-05   1.1e-05   11.2   0.1   120   162   579   622   561   646 0.85 1-984
A0A010SHV5           -            984 PhoH                 PF02562.20   205   1.9e-09   23.4   1.3   3   3      0.28      0.28   -3.3   0.0    81   126   731   776   718   777 0.83 1-984
A0A010SHV5           -            984 PIF1                 PF05970.18   223   2.3e-06   13.4   0.3   1   2     2e-05     2e-05   10.4   0.1     5    66   498   557   494   587 0.74 1-984
A0A010SHV5           -            984 PIF1                 PF05970.18   223   2.3e-06   13.4   0.3   2   2     0.023     0.023    0.4   0.0   131   174   597   640   590   675 0.73 1-984
A0A010SHV5           -            984 SH3_13               PF18335.5     65   4.5e-07   15.9   0.0   1   1   1.4e-06   1.4e-06   14.3   0.0    23    57   809   844   794   852 0.79 1-984
A0A010SHV5           -            984 TrwC                 PF08751.15   283   9.4e-91  290.3   5.0   1   2   9.4e-91   9.4e-91  290.3   5.0     3   282    14   292    12   293 0.95 1-984
A0A010SHV5           -            984 TrwC                 PF08751.15   283   9.4e-91  290.3   5.0   2   2     0.072     0.072   -1.3   1.3   194   231   915   953   891   981 0.55 1-984
A0A010SHV5           -            984 UvrD_C_2             PF13538.10    52   8.5e-11   27.9   0.3   1   1     2e-10     2e-10   26.7   0.3     1    48   868   911   868   914 0.89 1-984
A0A010SHV5           -            984 Viral_helicase1      PF01443.22   228   2.4e-17   49.6   0.5   1   2     1e-08     1e-08   21.4   0.0     3   102   516   623   514   690 0.80 1-984
A0A010SHV5           -            984 Viral_helicase1      PF01443.22   228   2.4e-17   49.6   0.5   2   2   4.1e-10   4.1e-10   25.9   0.2   160   228   852   916   791   916 0.62 1-984