althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
129 stars 12 forks source link

Hmmsearch callback tqdm update #60

Closed jpjarnoux closed 8 months ago

jpjarnoux commented 8 months ago

Hi, I have a question about callback in the hmmsearch function. I would update my progress after each query, but my code does not work as expected.

bar = tqdm(range(len(hmm_list)), unit="hmm", desc="Align gene families to HMM", disable=disable_bar)
    options = {"bit_cutoffs": bit_cutoffs, 'callback': lambda p: bar.update()}
    for top_hits in pyhmmer.hmmsearch(hmm_list, gf_sequences, cpus=threads, **options):

Maybe I do not understand how to use it.

I update it manually at the end of the for loop to make it work for the time, but I would also use this to write the name of the HMM in a debug (with the logging package). So, it seems a good idea to define a callback function.

Thanks for your help

althonos commented 8 months ago

Hi Jérôme,

The callback needs to take two arguments, the HMM object and the total number of currently loaded HMMs (useful in case you're reading the HMMs from a file, in which case the total is not known in advance and you can update it, tqdm doesn't support that but rich does).

In your snippet, that means:

options = {"bit_cutoffs": bit_cutoffs, 'callback': lambda hmm, total: bar.update()}

If i use only one argument like you did the progress bar is never updated, but since the exception is silenced the code enters a deadlock (the worker threads die on the exception, while the main thread still tries to pass them queries to process).

althonos commented 8 months ago

I've patched the deadlock, so now with the code above you'd actually get the error and traceback:

  0%|                                                | 0/20795 [00:00<?, ?hmm/s]Traceback (most recent call last):
  File "/home/althonos/Code/pyhmmer/issue.py", line 18, in <module>
    for top_hits in pyhmmer.hmmsearch(hmms, sequences, cpus=2, callback=callback):
  File "/home/althonos/Code/pyhmmer/pyhmmer/hmmer.py", line 520, in _multi_threaded
    yield results[0].get()
          ^^^^^^^^^^^^^^^^
  File "/home/althonos/Code/pyhmmer/pyhmmer/hmmer.py", line 122, in get
    raise self.exception
  File "/home/althonos/Code/pyhmmer/pyhmmer/hmmer.py", line 215, in run
    hits = self.process(chore.query)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/althonos/Code/pyhmmer/pyhmmer/hmmer.py", line 232, in process
    self.callback(query, self.query_count.value)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: <lambda>() takes 1 positional argument but 2 were given

I'll publish a patch shortly for the deadlock issue, but you don't need to wait for it, just to change the callback signature for your code to work :+1:

jpjarnoux commented 8 months ago

Hi, Thank you very much for your quick reply. In my case, I have only one HMM per file, so I assume I could consider the length of my pyhmmer.plan7.HMM list as the total number of HMM. Could you say how I could get the HMM object from the TopHits or Hit object? It's not clear to me.

althonos commented 8 months ago

You basically have two choices: