Closed zdk123 closed 1 year ago
Patch coverage: 100.00%
and project coverage change: +0.12%
:tada:
Comparison is base (
43e7d76
) 76.52% compared to head (0d85341
) 76.65%. Report is 2 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
You can actually set nincluded
and nreported
to zero for all hits before calling p7_tophits_Threshold
, I think the threshold will set the counts to the right number on its own.
Close this in favor of linking to the fixed hmmer library?
Nah, until a new numbered release of HMMER3 I'm not gonna bump the local version, so a temporary patch is welcome. It would be great If you can update the code as suggested above, and add a reminder to remove the patch later, similar to: https://github.com/althonos/pyhmmer/blob/e8f70f813ab43bbb012de78284cb35bc9cc7e7f7/pyhmmer/plan7.pyx#L7881-L7882
Thanks :pray:
@althonos Follow on bug and a question for you.
I am attempting to write a test but realize that users don't have read access to tophits._th.hit
- this only gets exposed by the internal hmmer domain table writer.
I thought instead to expose nincluded/nreported for all the hits in the state dictionary.
This okay to add here?
for i in range(self._th.N):
offset = (<ptrdiff_t> self._th.hit[i] - <ptrdiff_t> &self._th.unsrt[0]) // sizeof(P7_HIT)
hits.append(offset)
hits_nreported.append(self._th.hit[i].nreported)
hits_nincluded.append(self._th.hit[i].nincluded)
I've done this locally in a branch and found that, actually, to my surprise that nreported
/nincluded
might not being set for each hit after re-serializing in TopHits.__setstate__
. I think we need the following addition here:
for i, offset in enumerate(state["hit"]):
self._th.hit[i] = &self._th.unsrt[offset]
self._th.hit[i].nreported = state["hits_nreported"][i]
self._th.hit[i].nincluded = state["hits_nincluded"][i]
Is this desirable to have or do you want to have another way to return/set this data other than the state dictionary?
You can get the nreported
and nincluded
using len(hit.domains.included)
and len(hit.domains.reported)
because I made these properties return a sized iterator :relaxed: These numbers should be read-only because the TopHits
instances are immutable on the Python side, ignoring the option to sort the hits.
That's indeed a problem if the attributes are not reset after a __setstate__
.
I've done this locally in a branch and found that, actually, to my surprise that nreported/nincluded might not being set for each hit after re-serializing in TopHits.setstate. I think we need the following addition here:
Actually that should be handle by p7_hit_Deserialize
for each hit I think! So no need to do it in the Cython code.
Looks like my updated test caught another bug on the number of included domains, when merging with an empty TopHit object (and maybe more generally?)
from my branch:
from pyhmmer.plan7 import HMMFile, Pipeline, TopHits
from pyhmmer.easel import SequenceFile
thioesterase = HMMFile("pyhmmer/tests/data/hmms/db/Thioesterase.hmm").read()
with SequenceFile('pyhmmer/tests/data/seqs/938293.PRJEB85.HG003687.faa', digital=True) as seqfile:
proteins = seqfile.read_block()
pli = Pipeline(thioesterase.alphabet)
hits = pli.search_hmm(thioesterase, proteins)
[len(hit.domains.reported) for hit in hits]
# 1
[len(hit.domains.included) for hit in hits]
# 0
empty = TopHits()
empty_merged = empty.merge(hits)
[len(hit.domains.reported) for hit in empty_merged]
# 1
[len(hit.domains.included) for hit in empty_merged]
# 1
It looks like like empty TopHit gets initialized with inclusion thresholds of domain score with something other than the Pipeline defaults:
empty.incdomT
# 0.0
empty_merged.incdomT
# 0.0
hits.incdomT
# None
I understand this is extremely edge case and probably could just remove the test, but noting it here because it looks like merging TopHit objects with different inclusion/reporting cutoffs isn't being explicitly handled and maybe that should raise an Error?
Good catch! Merging TopHits with different parameters should indeed be disallowed. But I'm thinking how to fix this when TopHits are created from the Python constructor, I can't think of a simple solution...
Okay, thanks a lot :+1: I'm gonna merge as-is, then try to figure out a way to fix the second issue you raised :)
Stub of a fix for #46. It would probably be better if there was a true fix in libhmmer.
Does this look safe to do? I can write a formal test if so.