althonos / pyhmmer

Cython bindings and Python interface to HMMER3.
https://pyhmmer.readthedocs.io
MIT License
128 stars 12 forks source link

Running the hmmsearch #79

Open SergeyBaikal opened 1 month ago

SergeyBaikal commented 1 month ago

Dear authors, I find it difficult to run the program without knowing python. How can I run a similar command with the pyhmmer and extract only the best bit score? Could you please help me?

hmmsearch --tblout /home/Results_vog.txt -E 1e-5 --notextw --cpu 8 /home/Vog_219.hmm /home/proteins.fasta

Results_vog.txt > one_target_best_bit_score.tsv
althonos commented 1 week ago

Hi Sergey, unfortunately I'm a bit at loss here, the whole point of PyHMMER is to make it easier to run HMMER from Python ; I have not started making a CLI so at the moment it's not so straightforward to run directly through the command line.

Unless you have more than one HMM in your query file, you will not benefit from multiprocessing either, so I'd still encourage using the original hmmsearch for that purpose.

Nevertheless, I can provide you with a Python script that does what you want if that's helpful. Do you need to get the best bit score per target sequence for a single HMM?

SergeyBaikal commented 1 week ago

Dear althonos

Sorry for the long reply. I guess I didn't really understand what the PyHMMER is for. Yes, I was looking for a script to extract the best matches for the hmmsearch. But eventually came across a simple command awk '!x[$3]++' ./My_data/Results.txt > ./My_data/besthit.txt using hmmscan --tblout. Maybe that's not quite right. If you have the ability to make a parser to extract the best result from hmmsearch I would appreciate it.

I found part of the solution here https://github.com/linnabrown/run_dbcan/issues/27#issue-520657606 But I wanted a thing that would just show the best result from the results hmmsearch I got and not remove some matches.

althonos commented 4 days ago

Hi @SergeyBaikal,

This would probably work if you wanted to use PyHMMER: it uses a dictionary to store the best hit for each target, and then displays a small TSV table:

import pyhmmer

# load queries
alphabet = pyhmmer.easel.Alphabet.amino()
with pyhmmer.easel.SequenceFile("/home/proteins.fasta", digital=True, alphabet=alphabet) as seq_file:
    sequences = seq_file.read_block()

# record best hits by score
best_hits: dict[str, pyhmmer.plan7.Hit] = {}

# launch hmmer and collect best hit per target
with pyhmmer.plan7.HMMFile("/home/Vog_219.hmm") as hmm_file:
    for hits in pyhmmer.hmmer.hmmsearch(hmm_file, sequences):
        for hit in hits:
            hit_name = hit.name.decode()
            if hit_name not in best_hits or best_hits[hit_name].score < hit.score:
                best_hits[hit_name] = hit

# display best hits
for hit in best_hits.values():
    _decode = lambda x: None if x is None else x.decode()
    print(
        _decode(hit.hits.query.name),
        _decode(hit.hits.query.accession),
        _decode(hit.name),
        _decode(hit.accession),
        hit.evalue,
        hit.score,
        hit.bias,
        sep="\t"
    )