althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
125 stars 12 forks source link

[Bug] Pickling and unpickling an HMM object changes the cutoff values #67

Closed jolespin closed 6 months ago

jolespin commented 6 months ago

It took me a while to figure out what was happening or how to get a minimum reproducible example but I think I finally have one.

import pyhmmer
from pyhmmer.plan7 import HMMFile
from pyhmmer.easel import SequenceFile
from pyhmmer import hmmsearch
import sys, os, gzip,  pickle

# Versions
sys.version
# '3.9.18 | packaged by conda-forge | (main, Dec 23 2023, 16:36:46) \n[Clang 16.0.6 ]'

print(pyhmmer.__version__)
# 0.10.11

# Load in HMMs
name_to_hmm = dict()
with HMMFile("/Users/jolespin/Databases/Pfam/Pfam-A.hmm.gz") as hmm_file:
    for hmm in list(hmm_file):
        name_to_hmm[hmm.accession.decode()] = hmm

# Before pickling
query_hmm = name_to_hmm["PF09847.13"]
print(query_hmm.cutoffs.gathering)
# (33.20000076293945, 33.20000076293945)

# Write pickle
with gzip.open("PF09847.13.pkl.gz", "wb") as f:
    pickle.dump(query_hmm, f)

# Read pickle
with gzip.open("PF09847.13.pkl.gz", "rb") as f:
    query_hmm_reloaded = pickle.load(f)

# After pickling
print(query_hmm_reloaded.cutoffs.gathering)
# (0.07795137166976929, 0.010516392067074776)

For some reason, when I'm pickling my HMMs the cutoffs change and I'm not sure exactly why or how this could happen.

Can you give it a try on your system?

I'm just using Pfam-A.hmm.gz from https://www.ebi.ac.uk/interpro/download/Pfam/

Let me know if I can provide anymore context.

althonos commented 6 months ago

Hi @jolespin, thanks for the code to reproduce, I'm getting the same issue on my local system. What's strange is that the pickling of the cutoff is tested (I'm using PF02826, and there it works), but indeed for PF09847 the cutoffs get messed up. I'll have to figure out what's going on :sweat:

althonos commented 6 months ago

The culprits:

https://github.com/althonos/pyhmmer/blob/c5234e58fe0e0d6065073467fc728299eab448a4/pyhmmer/plan7.pyx#L2528-L2534

In the last memcpy, there is a typo, the model compositions are actually copied into the cutoffs, and overwrites the actual cutoffs. This probably wasn't an issue with PF02826 because it doesn't have a model composition set.

althonos commented 6 months ago

Fixed in v0.10.12.

jolespin commented 6 months ago

Excellent! Thank you for handling this so quickly. I was sitting with it for a day or two trying to figure out if I was making an obvious mistake. Glad the deep dive wasn't for nothing and could help with your development.

Took a look at your PR, why does the switch from t2pks and RREFam fix the pickling issue (IIRC the .h3m are one of the indexed HMM files)?

"with pyhmmer.plan7.HMMFile(\"data/hmms/bin/t2pks.h3m\") as hmms:\n",
"with pyhmmer.plan7.HMMFile(\"data/hmms/bin/RREFam.h3m\") as hmms:\n",
althonos commented 6 months ago

Ah no, the switch is for something different, I just wanted to reduce the size of the test data (mostly because I'm starting to hit the PyPI.org project size limit given that the wheels are getting big and I have 30+ wheels per release) so I used smaller HMMs.

The fix is just in 7b1ffb8 :smile: