althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
128 stars 12 forks source link

[Feature Request] Internally assigning Z for ```HMM``` class #57

Open erfanshekarriz opened 11 months ago

erfanshekarriz commented 11 months ago

Hello!

Thank you for the awesome and much-needed package.

I wanted to ask if there could be an internal way to determine the Z value or if it can be stored as an internal attribute of the plan7. HMMFile class. For example :

from pyhmmer.plan7 import HMMFile
from pyhmmer.plan7 import SequenceFile

with HMMFile("path/to/hmm/db.hmm") as hmm_fiel:
   hmms = hmm_file.read()
   Z_val = hmms.Z 

   with SequenceFile(proteins, digital=True) as seqs:
      prots = seq_file.read_block()
      pli = Pipeline(hmms.alphabet, Z=Z_val, bit_cutoffs="gathering")
      hits = pli.search_hmm(hmms, prots)

Sometimes I am working with hmm databases where I don't know the z value in advance. I usually find out using

z_score=$(grep -c "NAME" ${input.db})

but it would be nice if it could be easily internally handled!

Best,

Erfan

erfanshekarriz commented 11 months ago

I've found an easy workaround with:

 with HMMFile(hmmdb) as hmm_file:
   hmms = list(hmm_file)
   Z_val = len(hmms)

but I'll still keep the issue open since an attribute for the class might make it more intuitive.

Also, correct me if I'm wrong, but can we pass the Z_val and bit_cuttoffs to pyhmmer.hmmer.hmmsearch as follow:

pyhmmer.hmmer.hmmsearch(hmms, prots, cpus=cores_n, Z=Z_val, bit_cutoffs="gatherings")

Not sure because I couldn't find it anywhere in the documentation. I could only find inputs of Z and bit_cutoff in Pipeline class but would be nice to also include in the HMMER class options too.

Thanks!

althonos commented 11 months ago

I wanted to ask if there could be an internal way to determine the Z value or if it can be stored as an internal attribute of the plan7. HMMFile class.

Unfortunately I this cannot be done in PyHMMER, because I'm wrapping the original HMMER implementation and HMM file format, so I cannot change what is getting serialized and stored in there. I would recommend that you just store the Z value somewhere else if you can control the HMMs beings used (such as versioned Pfam).

I've found an easy workaround ...

The big problem here is that you load all HMMs into a list when you're only interested in counting them :wink: If you need to pre-compute the Z value, consider:

 with HMMFile(hmmdb) as hmm_file:
   Z = sum(1 for hmm in hmm_file)
althonos commented 11 months ago

Oh and regarding this:

Not sure because I couldn't find it anywhere in the documentation. I could only find inputs of Z and bit_cutoff in Pipeline class but would be nice to also include in the HMMER class options too.

The hmmsearch documentation states that:

Any additional arguments passed to the hmmsearch function will be passed transparently to the Pipeline to be created.

Which is what happens there; by passing additional keyword arguments, you configure the Pipeline and make it use the provided Z and bit_cutoffs arguments.