Scoring a set of sequences with HMM

paoslaos commented 1 year ago

Thanks for this great library! I just started recently to dig into using HMMs so my question is very very basic. I have build an HMM from an alignment to a specific protein family, now I want to use the HMM to score a list of other protein sequences (all of them).

I tried using the TraceAligner with something like this because hmmalign() would not give the desired metrics (?):

    seq_block = pyhmmer.easel.DigitalSequenceBlock(alphabet, seq)
    aligner = pyhmmer.plan7.TraceAligner(posteriors=True)
    traces = aligner.compute_traces(hmm=hmm, sequences=seq_block)
    msa = aligner.align_traces(hmm=hmm, sequences=seq_block, traces=traces)
    scores = [trace.expected_accuracy() for trace in traces]

Is there an alternative that you recommend to get various alignment scores for my query sequences? I need all of them scored and typically I am interested in alignment scores, e-values, and other easily available metrics.

Great appreciation!

Peace Paos

althonos commented 1 year ago

Hi Paos, sounds like you want to use hmmsearch or hmmscan here! hmmalign is a very specific case where you have a HMM and sequence which you already know belong to the same sequence family, and you want to use the HMM as a guide for aligning all of them.

Using hmmsearch you will get one TopHits object per input HMM, which will summarize the hits it got to all of your sequences databases, including bitscore, p-value and e-values.

paoslaos commented 1 year ago

Dear Martin,

thanks for your help! To clarify a bit:

singel HMM
single set of unaligned sequences
goal: get a scores for the unaligned sequences from single hmm
all sequences are already pre-filtered and should have enough similarity to the protein family from the HMM for a meaningful scoring

First, I wasnt able to get a score for each of my sequences. It took me a bit to realize the right filtering options because changing E, T didnt seem to do the trick.

For the output below, a few notes: Interestingly, sequence[-2] slipped into my test set in my uniprot query to build my test case so it turns out to be farer away from the other sequences. sequence[-1] is a random amino acid sequence added for testing the cutoffs. I also have duplicates in the set.

For reference, here is my snippet:

    pli = pyhmmer.plan7.Pipeline(alphabet)
    result = pyhmmer.hmmer.hmmscan(hmm_sequences, [hmm_model], bias_filter=False, F1=1.0, F2=1.0, F3=1.0)
    result_list  = list(result)
    print(f"N sequences to score: {len(sequences)}")
    print("idx score evalue pvalue total_n current_n")
    for idx, hit_iter in enumerate(result_list):
        for hit in hit_iter:
            print(idx+1, np.round(hit.score, 2), np.round(hit.evalue, 2), np.round(hit.pvalue, 2),len(result_list), len(hit_iter))

Which gives:

N sequences to score: 14
idx score evalue pvalue total_n current_n
1 1226.81 0.0 0.0 14 1
2 1226.81 0.0 0.0 14 1
3 1222.29 0.0 0.0 14 1
4 1214.66 0.0 0.0 14 1
5 1175.53 0.0 0.0 14 1
6 1198.71 0.0 0.0 14 1
7 1195.83 0.0 0.0 14 1
8 1190.86 0.0 0.0 14 1
9 1197.9 0.0 0.0 14 1
10 1184.9 0.0 0.0 14 1
11 1038.67 0.0 0.0 14 1
12 1226.81 0.0 0.0 14 1
13 -3.27 0.11 0.11 14 1
14 -4.82 0.32 0.32 14 1

If you have any other hint for me, I am happy to learn more, or see if this makes sense to you. Otherwise this can be closed!

Thank you so much!!!!! Paos

althonos commented 1 year ago

First, I wasnt able to get a score for each of my sequences. It took me a bit to realize the right filtering options because changing E, T didnt seem to do the trick.

Yes exactly, the typical usecase of HMMER is to find some hits for a HMM in a lot of proteins so there are several pre-filtering heuristics to kick out very distant sequences (with a cheaper version of the algorithm) without actually computing scores for the remaining ones. If your last two sequences are quite distant it's very likely they were being kicked by the filter first, unless you disable it as you did.

For reference, here is my snippet:

In here i'd recommend using hmmsearch, it's usually faster, and then you get a single TopHits for your query, so extracting the results is easier (something like):

hits = next(pyhmmer.hmmer.hmmscan(hmm_model, hmm_sequences, bias_filter=False, F1=1.0, F2=1.0, F3=1.0)
print("name score evalue pvalue total_n current_n")
for hit in hits:
    print(hit.name, round(hit.score, 2), round(hit.evalue), round(hit.pvalue))

Otherwise I guess you're settled! :)

althonos / pyhmmer

Scoring a set of sequences with HMM #55