Possibility to filter alignment results on coverage between HMM and target

althonos / pyhmmer

Cython bindings and Python interface to HMMER3.

https://pyhmmer.readthedocs.io

MIT License

130 stars 12 forks source link

Possibility to filter alignment results on coverage between HMM and target #27

Open jpjarnoux opened 2 years ago

jpjarnoux commented 2 years ago

Hi !

I was searching if that's possible to report the alignment coverage on the HMM and the target. I'm using the PADLOC-DB as HMM database and I notice that they include hmm.coverage.threshold and target.coverage.threshold to filter results in there metadata file.

So I was searching if these values were in the HIT object, but they're not, and I don't find them in the documentation. Did I miss something ?

Thanks

althonos commented 2 years ago

Hi @jpjarnoux

If I'm not mistaken, this option doesn't exist in the original HMMER either, because of the difficulty to compute coverage for an alignment which can contain more than one domain. I like this example (from the documentation examples):

jpjarnoux commented 1 year ago

Hi, Thanks, I will try to catch up why there is this value in the metadata and how they're using it.

jpjarnoux commented 1 year ago

Hi, I come back to this question. I have some new information. To compute the target and hmm coverage, they choose the best domain. I tried to reproduce, but unfortunately the length of the query and the target (HMM and sequence) are not saved and accessible in the HIT. I would to take it when I digitalized my sequences and when I'm reading my HMM, but the length is not an attribute in HMM object and DigitalSequence. For the sequence I can easily get the length by another way, but for the HMM, it could be really easier if the length was read and add as an attribute (in HMM file there is LENG key). Do you think you could add those lengths as attribute in Sequence, HMM and Hit or Alignment object ?

apcamargo commented 1 year ago

Hi @jpjarnoux

You can get the length of HMMs from HMM objects:

hmm_lengths = {}
with pyhmmer.plan7.HMMFile("Pfam-A.h3m") as hmm_file:
    for hmm in hmm_file:
        hmm_lengths[hmm.name] = len(hmm.consensus)

Then:

n_aligned_positions = len(
    hit.best_domain.alignment.hmm_sequence
) - hit.best_domain.alignment.hmm_sequence.count(".")
hmm_coverage = (
    n_aligned_positions / hmm_lengths[hit.best_domain.alignment.hmm_name]
)

As @althonos said, this is an oversimplification because you can have multiple domains in the hit. But it can be pretty useful for HMMs of full-length proteins.

jpjarnoux commented 1 year ago

Sorry, I forget to reply. That works for me, thank you.

apcamargo commented 7 months ago

A less dumb approach:

def get_hmm_coverage(domain):
    n_aligned_positions = domain.alignment.hmm_to - domain.alignment.hmm_from + 1
    return n_aligned_positions / domain.alignment.hmm_length

with pyhmmer.plan7.HMMFile("Pfam-A.h3m") as hmm_file:
    for hits in pyhmmer.hmmsearch(hmm_file, seqs, bit_cutoffs="gathering"):
        for hit in hits:
            for domain in hit.domains.included:
                hmm_coverage = get_hmm_coverage(domain)