Closed oschwengers closed 1 month ago
Hi!
I checked and query_length seems to already be in the object. https://pyhmmer.readthedocs.io/en/stable/api/plan7/results.html#pyhmmer.plan7.TopHits.query_length
top_hits = hmmsearch(**kwargs)
query_length = top_hits.query_length
Hope it helps you.
Thanks @jpjarnoux for the prompt reply.
However, there was a typo (wrongly copy-pasted) in the text above. Instead of TopHits.query_length
(which indeed already exists), I meant TopHits.query_description
for the related DESC
field of the HMM files.
I amended my initial query above.
Ok, so a way to do what you want is:
top_hits = hmmsearch(**kwargs)
hit = next(top_hits)
query_desc = hit.query_desc
I have not tested this, but I think it works.
Obviously, an attribute could be really useful here.
@jpjarnoux I've just tested it. There is no such attribute hit.query_desc
. Can you point me to an entry in the docs: https://pyhmmer.readthedocs.io/en/stable/api/index.html ?
Hi! Sorry about that I made a confusion somehow. If you want the description of the target https://pyhmmer.readthedocs.io/en/stable/api/plan7/results.html#pyhmmer.plan7.Hit.description :
top_hits = hmmsearch(**kwargs)
hit = next(top_hits)
target_desc = hit.description
This is not possible for the query. I swear I saw something, but I must have been confused with another thing.
Thanks @jpjarnoux for the clarification.
Unfortunately, I also haven't seen anything close to the query description (the DESC
line in the HMM files). However, for Bakta we need exactly this field to provide users with an option to feed-in additional information into the workflow via a customized format, just like what, for example, Prokka does (https://github.com/tseemann/prokka?tab=readme-ov-file#fasta-database-format, using the original hmmsearch
).
@althonos I'd love to use Pyhmmer for this in Bakta. Hence, would it be possible to add such a query description field TopHits.query_description
to the API? I haven't looked into the code, but I could image that it should not be that much of a deal, but a huge advantage for many users.
Hi Oliver,
It depends how you are calling the pipeline, but by design the hmmsearch function yields the hits in the same order the HMMs are passed. This means that with the right use of zip and itertools.tee you can get an iterator on both the query HMM and the TopHits:
queries = HMMFile("...")
hmms1, hmms2 = itertools.tee(queries)
for hmm, hits in zip(hmms1, hmmsearch(hmms2, target_sequences)):
print(hmm.description, len(hits)))
Thanks @althonos for the explanation and sorry for being so persistent on this, but wouldn't it be much easier for Pyhmmer users to simply provide the HMM description in the TopHits
object?
For instance, in v0.10.5, the TopHits.query_length
attribute was added as part of the API. Wouldn't it be possible (and much easier to use) to add a related TopHits.query_description
attribute, as well?
Or is there maybe a technical reason why some fields (query_name
, query_accession
, query_length
) are available in the TopHits
object, but description
is not? It would make life so much easier to use Pyhmmer. If this is not possible, I think I have to go back using the original hmmsearch
- though I really don't want to and would love you use Pyhmmer.
OK, I found an acceptable workaround:
with pyhmmer.plan7.HMMFile(user_hmms_path) as hmms_fh:
hmms = list(hmms_fh)
hmms_id_desc = { hmm.accession.decode(): hmm.description.decode() for hmm in hmms }
for hmm_query_hits in pyhmmer.hmmsearch(hmms, proteins):
However, I still feel like having a direct access to the TopHits.query_description
field would be a plus.
Please, feel free to just close this, or leave it upon just in case you consider adding the desc field. Anyway, thank you very much for this great Python wrapper (and many others) - very much appreciated!
Or is there maybe a technical reason why some fields (query_name, query_accession, query_length) are available in the TopHits object, but description is not?
The main problem is that for every field I am reserving more space in each TopHits
object which I was reluctant to do because of memory usage. But I think it may actually better to add a reference to the original query rather than manually handling all the fields. I'll experiment a bit to see if this can be done properly, so that you could do e.g. TopHits.query.description
.
That would be awesome, thanks a lot!
Awesome! Thank you so much!
I'm making a release now :smile:
Hi @althonos and many thanks for maintaining and pushing forward this wonderful project!
I just started to use PyHMMER in Bakta for user-provided HMMs. Here, I'd like to provide the users with an option to feed some extra information on each HMM into the pipeline via a custom format for the
description
field in the HMM file.However, there seems no such field in the API of the TopHits object. For
NAME
,ACC
andLENG
there areTopHits.query_name
,TopHits.query_accession
andTopHits.query_length
, respectively.Hence, could you maybe add
TopHits.query_description
for the relatedDESC
?Thanks again!
PS: Just in case I overlooked something, could please point me to the right API object?