althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
129 stars 12 forks source link

Get AllHits instead of TopHits #77

Open jpjarnoux opened 2 months ago

jpjarnoux commented 2 months ago

Hi!

I suggest adding the possibility of getting all hits instead of the TopHits. As you explain in other issues (#65 or #66), not all hits are reported. So when I compare the result with hmmsearch, I have some hits reported in the domtblout that are not in the pyhmmer.hmmsearch domtblout result.

The best way to deal with that would be to have an argument in pyhmmer.hmmsearch to set the number of hits to report. If set to None by default it will keep the current results, if set at 0 all hits are reported.

Let me know if I understand well how it works

althonos commented 1 month ago

The problem is that HMMER internally does not report all hits either to save space; that's why PyHMMER can only give you the top hits as well, because that's all it gets from the internal HMMER pipeline. By default, HMMER and PyHMMER use a reporting threshold of E=10, so all significative hits plus 10 false positives.

I think in #75 the difference in E-value computation may also be the source of the different number of reported hits you're getting, because in theory you should get the same number of hits between HMMER and PyHMMER (I'm testing for that in the unit tests); however if the E-values are not computed the same way, then some values may be above threshold and don't get reported.

jpjarnoux commented 1 month ago

Hi !

Yes, I found out it was not possible either in HMMER. The problem came from my database, so no problem here with pyhmmer.

Thanks