althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
129 stars 12 forks source link

Provide the description of an HMM #76

Closed oschwengers closed 1 month ago

oschwengers commented 2 months ago

Hi @althonos and many thanks for maintaining and pushing forward this wonderful project!

I just started to use PyHMMER in Bakta for user-provided HMMs. Here, I'd like to provide the users with an option to feed some extra information on each HMM into the pipeline via a custom format for the description field in the HMM file.

However, there seems no such field in the API of the TopHits object. For NAME, ACC and LENG there are TopHits.query_name, TopHits.query_accession and TopHits.query_length, respectively.

Hence, could you maybe add TopHits.query_description for the related DESC?

Thanks again!

PS: Just in case I overlooked something, could please point me to the right API object?

jpjarnoux commented 2 months ago

Hi!

I checked and query_length seems to already be in the object. https://pyhmmer.readthedocs.io/en/stable/api/plan7/results.html#pyhmmer.plan7.TopHits.query_length

top_hits = hmmsearch(**kwargs)
query_length = top_hits.query_length

Hope it helps you.

oschwengers commented 2 months ago

Thanks @jpjarnoux for the prompt reply. However, there was a typo (wrongly copy-pasted) in the text above. Instead of TopHits.query_length (which indeed already exists), I meant TopHits.query_description for the related DESC field of the HMM files.

I amended my initial query above.

jpjarnoux commented 2 months ago

Ok, so a way to do what you want is:

top_hits = hmmsearch(**kwargs)
hit = next(top_hits)
query_desc = hit.query_desc

I have not tested this, but I think it works.

Obviously, an attribute could be really useful here.

oschwengers commented 1 month ago

@jpjarnoux I've just tested it. There is no such attribute hit.query_desc. Can you point me to an entry in the docs: https://pyhmmer.readthedocs.io/en/stable/api/index.html ?

jpjarnoux commented 1 month ago

Hi! Sorry about that I made a confusion somehow. If you want the description of the target https://pyhmmer.readthedocs.io/en/stable/api/plan7/results.html#pyhmmer.plan7.Hit.description :

top_hits = hmmsearch(**kwargs)
hit = next(top_hits)
target_desc = hit.description

This is not possible for the query. I swear I saw something, but I must have been confused with another thing.

oschwengers commented 1 month ago

Thanks @jpjarnoux for the clarification.

Unfortunately, I also haven't seen anything close to the query description (the DESC line in the HMM files). However, for Bakta we need exactly this field to provide users with an option to feed-in additional information into the workflow via a customized format, just like what, for example, Prokka does (https://github.com/tseemann/prokka?tab=readme-ov-file#fasta-database-format, using the original hmmsearch).

@althonos I'd love to use Pyhmmer for this in Bakta. Hence, would it be possible to add such a query description field TopHits.query_description to the API? I haven't looked into the code, but I could image that it should not be that much of a deal, but a huge advantage for many users.

althonos commented 1 month ago

Hi Oliver,

It depends how you are calling the pipeline, but by design the hmmsearch function yields the hits in the same order the HMMs are passed. This means that with the right use of zip and itertools.tee you can get an iterator on both the query HMM and the TopHits:

queries = HMMFile("...")
hmms1, hmms2 = itertools.tee(queries)
for hmm, hits in zip(hmms1, hmmsearch(hmms2, target_sequences)):
   print(hmm.description, len(hits)))
oschwengers commented 1 month ago

Thanks @althonos for the explanation and sorry for being so persistent on this, but wouldn't it be much easier for Pyhmmer users to simply provide the HMM description in the TopHits object?

For instance, in v0.10.5, the TopHits.query_length attribute was added as part of the API. Wouldn't it be possible (and much easier to use) to add a related TopHits.query_description attribute, as well?

Or is there maybe a technical reason why some fields (query_name, query_accession, query_length) are available in the TopHits object, but description is not? It would make life so much easier to use Pyhmmer. If this is not possible, I think I have to go back using the original hmmsearch - though I really don't want to and would love you use Pyhmmer.

oschwengers commented 1 month ago

OK, I found an acceptable workaround:

with pyhmmer.plan7.HMMFile(user_hmms_path) as hmms_fh:
        hmms = list(hmms_fh)
        hmms_id_desc = { hmm.accession.decode(): hmm.description.decode() for hmm in hmms }
        for hmm_query_hits in pyhmmer.hmmsearch(hmms, proteins):

However, I still feel like having a direct access to the TopHits.query_description field would be a plus.

Please, feel free to just close this, or leave it upon just in case you consider adding the desc field. Anyway, thank you very much for this great Python wrapper (and many others) - very much appreciated!

althonos commented 1 month ago

Or is there maybe a technical reason why some fields (query_name, query_accession, query_length) are available in the TopHits object, but description is not?

The main problem is that for every field I am reserving more space in each TopHits object which I was reluctant to do because of memory usage. But I think it may actually better to add a reference to the original query rather than manually handling all the fields. I'll experiment a bit to see if this can be done properly, so that you could do e.g. TopHits.query.description.

oschwengers commented 1 month ago

That would be awesome, thanks a lot!

oschwengers commented 1 month ago

Awesome! Thank you so much!

althonos commented 1 month ago

I'm making a release now :smile: