Closed paoslaos closed 1 year ago
Hi Paos, sounds like you want to use hmmsearch
or hmmscan
here! hmmalign
is a very specific case where you have a HMM and sequence which you already know belong to the same sequence family, and you want to use the HMM as a guide for aligning all of them.
Using hmmsearch
you will get one TopHits
object per input HMM, which will summarize the hits it got to all of your sequences databases, including bitscore, p-value and e-values.
Dear Martin,
thanks for your help! To clarify a bit:
First, I wasnt able to get a score for each of my sequences. It took me a bit to realize the right filtering options because changing E, T didnt seem to do the trick.
For the output below, a few notes: Interestingly, sequence[-2] slipped into my test set in my uniprot query to build my test case so it turns out to be farer away from the other sequences. sequence[-1] is a random amino acid sequence added for testing the cutoffs. I also have duplicates in the set.
For reference, here is my snippet:
pli = pyhmmer.plan7.Pipeline(alphabet)
result = pyhmmer.hmmer.hmmscan(hmm_sequences, [hmm_model], bias_filter=False, F1=1.0, F2=1.0, F3=1.0)
result_list = list(result)
print(f"N sequences to score: {len(sequences)}")
print("idx score evalue pvalue total_n current_n")
for idx, hit_iter in enumerate(result_list):
for hit in hit_iter:
print(idx+1, np.round(hit.score, 2), np.round(hit.evalue, 2), np.round(hit.pvalue, 2),len(result_list), len(hit_iter))
Which gives:
N sequences to score: 14
idx score evalue pvalue total_n current_n
1 1226.81 0.0 0.0 14 1
2 1226.81 0.0 0.0 14 1
3 1222.29 0.0 0.0 14 1
4 1214.66 0.0 0.0 14 1
5 1175.53 0.0 0.0 14 1
6 1198.71 0.0 0.0 14 1
7 1195.83 0.0 0.0 14 1
8 1190.86 0.0 0.0 14 1
9 1197.9 0.0 0.0 14 1
10 1184.9 0.0 0.0 14 1
11 1038.67 0.0 0.0 14 1
12 1226.81 0.0 0.0 14 1
13 -3.27 0.11 0.11 14 1
14 -4.82 0.32 0.32 14 1
If you have any other hint for me, I am happy to learn more, or see if this makes sense to you. Otherwise this can be closed!
Thank you so much!!!!! Paos
First, I wasnt able to get a score for each of my sequences. It took me a bit to realize the right filtering options because changing E, T didnt seem to do the trick.
Yes exactly, the typical usecase of HMMER is to find some hits for a HMM in a lot of proteins so there are several pre-filtering heuristics to kick out very distant sequences (with a cheaper version of the algorithm) without actually computing scores for the remaining ones. If your last two sequences are quite distant it's very likely they were being kicked by the filter first, unless you disable it as you did.
For reference, here is my snippet:
In here i'd recommend using hmmsearch
, it's usually faster, and then you get a single TopHits
for your query, so extracting the results is easier (something like):
hits = next(pyhmmer.hmmer.hmmscan(hmm_model, hmm_sequences, bias_filter=False, F1=1.0, F2=1.0, F3=1.0)
print("name score evalue pvalue total_n current_n")
for hit in hits:
print(hit.name, round(hit.score, 2), round(hit.evalue), round(hit.pvalue))
Otherwise I guess you're settled! :)
Thanks for this great library! I just started recently to dig into using HMMs so my question is very very basic. I have build an HMM from an alignment to a specific protein family, now I want to use the HMM to score a list of other protein sequences (all of them).
I tried using the TraceAligner with something like this because hmmalign() would not give the desired metrics (?):
Is there an alternative that you recommend to get various alignment scores for my query sequences? I need all of them scored and typically I am interested in alignment scores, e-values, and other easily available metrics.
Great appreciation!
Peace Paos