althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
128 stars 12 forks source link

De-increment nincluded/nreported after merge #47

Closed zdk123 closed 1 year ago

zdk123 commented 1 year ago

Stub of a fix for #46. It would probably be better if there was a true fix in libhmmer.

Does this look safe to do? I can write a formal test if so.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.12% :tada:

Comparison is base (43e7d76) 76.52% compared to head (0d85341) 76.65%. Report is 2 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #47 +/- ## ========================================== + Coverage 76.52% 76.65% +0.12% ========================================== Files 7 7 Lines 6965 6968 +3 ========================================== + Hits 5330 5341 +11 + Misses 1635 1627 -8 ``` | Flag | Coverage Δ | | |---|---|---| | v3.10 | `76.65% <100.00%> (+0.12%)` | :arrow_up: | | v3.11 | `76.65% <100.00%> (+0.12%)` | :arrow_up: | | v3.7 | `76.62% <100.00%> (+0.12%)` | :arrow_up: | | v3.8 | `76.65% <100.00%> (+0.12%)` | :arrow_up: | | v3.9 | `76.65% <100.00%> (+0.12%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Martin+Larralde#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files Changed](https://app.codecov.io/gh/althonos/pyhmmer/pull/47?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Martin+Larralde) | Coverage Δ | | |---|---|---| | [pyhmmer/plan7.pyx](https://app.codecov.io/gh/althonos/pyhmmer/pull/47?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Martin+Larralde#diff-cHlobW1lci9wbGFuNy5weXg=) | `73.87% <100.00%> (+0.26%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

althonos commented 1 year ago

You can actually set nincluded and nreported to zero for all hits before calling p7_tophits_Threshold, I think the threshold will set the counts to the right number on its own.

zdk123 commented 1 year ago

Close this in favor of linking to the fixed hmmer library?

althonos commented 1 year ago

Nah, until a new numbered release of HMMER3 I'm not gonna bump the local version, so a temporary patch is welcome. It would be great If you can update the code as suggested above, and add a reminder to remove the patch later, similar to: https://github.com/althonos/pyhmmer/blob/e8f70f813ab43bbb012de78284cb35bc9cc7e7f7/pyhmmer/plan7.pyx#L7881-L7882

Thanks :pray:

zdk123 commented 1 year ago

@althonos Follow on bug and a question for you.

I am attempting to write a test but realize that users don't have read access to tophits._th.hit - this only gets exposed by the internal hmmer domain table writer.

I thought instead to expose nincluded/nreported for all the hits in the state dictionary.

This okay to add here?

        for i in range(self._th.N):
            offset = (<ptrdiff_t> self._th.hit[i] - <ptrdiff_t> &self._th.unsrt[0]) // sizeof(P7_HIT)
            hits.append(offset)
            hits_nreported.append(self._th.hit[i].nreported)
            hits_nincluded.append(self._th.hit[i].nincluded)

I've done this locally in a branch and found that, actually, to my surprise that nreported/nincluded might not being set for each hit after re-serializing in TopHits.__setstate__. I think we need the following addition here:

        for i, offset in enumerate(state["hit"]):
            self._th.hit[i] = &self._th.unsrt[offset]
            self._th.hit[i].nreported = state["hits_nreported"][i]
            self._th.hit[i].nincluded = state["hits_nincluded"][i]

Is this desirable to have or do you want to have another way to return/set this data other than the state dictionary?

althonos commented 1 year ago

You can get the nreported and nincluded using len(hit.domains.included) and len(hit.domains.reported) because I made these properties return a sized iterator :relaxed: These numbers should be read-only because the TopHits instances are immutable on the Python side, ignoring the option to sort the hits.

That's indeed a problem if the attributes are not reset after a __setstate__.

althonos commented 1 year ago

I've done this locally in a branch and found that, actually, to my surprise that nreported/nincluded might not being set for each hit after re-serializing in TopHits.setstate. I think we need the following addition here:

Actually that should be handle by p7_hit_Deserialize for each hit I think! So no need to do it in the Cython code.

zdk123 commented 1 year ago

Looks like my updated test caught another bug on the number of included domains, when merging with an empty TopHit object (and maybe more generally?)

from my branch:

from pyhmmer.plan7 import HMMFile, Pipeline, TopHits
from pyhmmer.easel import SequenceFile

thioesterase = HMMFile("pyhmmer/tests/data/hmms/db/Thioesterase.hmm").read()

with SequenceFile('pyhmmer/tests/data/seqs/938293.PRJEB85.HG003687.faa', digital=True) as seqfile:
    proteins = seqfile.read_block()

pli = Pipeline(thioesterase.alphabet)

hits = pli.search_hmm(thioesterase, proteins)

[len(hit.domains.reported) for hit in hits]
# 1
[len(hit.domains.included) for hit in hits]
# 0

empty = TopHits()
empty_merged = empty.merge(hits)
[len(hit.domains.reported) for hit in empty_merged]
# 1
[len(hit.domains.included) for hit in empty_merged]
# 1

It looks like like empty TopHit gets initialized with inclusion thresholds of domain score with something other than the Pipeline defaults:

empty.incdomT
# 0.0
empty_merged.incdomT
# 0.0
hits.incdomT
# None

I understand this is extremely edge case and probably could just remove the test, but noting it here because it looks like merging TopHit objects with different inclusion/reporting cutoffs isn't being explicitly handled and maybe that should raise an Error?

althonos commented 1 year ago

Good catch! Merging TopHits with different parameters should indeed be disallowed. But I'm thinking how to fix this when TopHits are created from the Python constructor, I can't think of a simple solution...

althonos commented 1 year ago

Okay, thanks a lot :+1: I'm gonna merge as-is, then try to figure out a way to fix the second issue you raised :)