TurtleTools / caretta

A software-suite to perform multiple protein structure alignment and structure feature extraction.
BSD 3-Clause "New" or "Revised" License
26 stars 4 forks source link

Is it possible to output Tm scores? #14

Closed donal1 closed 2 years ago

donal1 commented 2 years ago

Hi there,

I was wondering if it's possible to output the tm-scores between proteins. Also what exactly is the distance matrix outputted?

All the best

akdel commented 2 years ago

Hi! We do have the function for computing tm-align scores in the python api (not in the commandline). This is a class method of the StructureMultiple class called make_rmsd_coverage_tm_matrix and returns the tm-scores as the third object. You can save the StructureMultiple class using the commandline tool with the --class flag, and load it with python.

The computed distance matrix is based on geometricus as described in our paper, or based on the caretta score if using the --full flag.

donal1 commented 2 years ago

I could be confused but is the pairwise tm score of proteins the same as the tm score over a multi alignment? I look at the reported tm score for the AAA family in Homstad computed this with mtm align 0.622. They also output a pairwise tm score but this seems to be different. Is it possible to compute the tm score for the multi alignment as well as for the pairwise. Thanks again.

donal1 commented 2 years ago
akdel commented 2 years ago

In general, the pairwise scores are different from the multiple alignment scores. The pairwise score is the optimum alignment score (based on the distance matrix and gap parameters) between the two structures, however, the multiple alignment algorithm uses heuristics (guide tree and progressive alignment step) to compute the multiple alignment efficiently.

For the second part of your question, yes you can get pairwise tm-scores from the multiple sequence alignment but as I mention above, this won't be the most optimum result between the pairs as the multiple sequence alignment is meant to be used to compare all structures at once.

donal1 commented 2 years ago

Okay this is what want to know. I'm trying to compare the reported mtm-align results with caretta just to verify. The homstrad dataset has reported tm score for multi alignment of proteins in the families as well m-Tm align also has these reported scores. I am trying to do the same for caretta and obviously I'm missing something. What is the function to get the multiple alignment tm score?

donal1 commented 2 years ago

Also I think I'm missing something just to clarify the pairwise score gotten from caretta isn't the same as the mtm-align pairwise matrix? As mtm-align is very time consuming and caretta is quite fast I was using caretta to get tm-scores between proteins.

akdel commented 2 years ago

Okay this is what want to know. I'm trying to compare the reported mtm-align results with caretta just to verify. The homstrad dataset has reported tm score for multi alignment of proteins in the families as well m-Tm align also has these reported scores. I am trying to do the same for caretta and obviously I'm missing something. What is the function to get the multiple alignment tm score?

You can get the pairwise tm-scores from the python api by using the StructureMultiple class method make_rmsd_coverage_tm_matrix. If needed, I can make an example gist to show how you can use it.

Also I think I'm missing something just to clarify the pairwise score gotten from caretta isn't the same as the mtm-align pairwise matrix? As mtm-align is very time consuming and caretta is quite fast I was using caretta to get tm-scores between proteins.

Do you mean the commandline score matrix from caretta? If so, then those are caretta scores and not tm-scores. If you used the make_rmsd_coverage_tm_matrix then I will take a look at why there's a discrepancy between the outputs from the two tools (assuming the alignments are identical).

donal1 commented 2 years ago

Thanks I know how to get the pairwise but https://yanglab.nankai.edu.cn/mTM-align/benchmark/homstrad.html reports a tm score from multi alignment it doesn't report a pairwise score. ATPase family associated with various cellular activities (AAA) rerports a single score of 0.513. What I want is to report a single tm score for a family not pairwise.

What I am trying to do is see the tm score reported from caretta mutli alignment is better or worse then mtm align and compare this score to the reported homstrad score as it can be taken as the groundtruth.

akdel commented 2 years ago

It's mentioned in the mtm-align paper that the single tm-score is the mean of the pairwise tm-scores. Then you could do the same to the pairwise output from caretta.

donal1 commented 2 years ago

ahh it is great thank you very much I missed that I thought it was something more complex. Thanks again for the quick replies you guys are saving me so much time have a really great weekend.

donal1 commented 2 years ago

Apparently it isn't exactly the mean its the mean and normalised by the smaller protein. It's unclear what they mean from the paper. Would you have any idea?

I've narrowed it down to the mtmalign.cpp file.

donal1 commented 2 years ago

Sorry this may be a long post with a few questions.

  1. Caretta produces pairwise tm scores. But these scores are not all vs all pairwise alignment like mTM align. As you said this won't be the most optimum result between the pairs as the multiple sequence alignment is meant to be used to compare all structures at once. So the pairwise matrices produced by each method are different? mtm align produces the optimal scores and caretta finds the conserved and variable residues across a set of proteins.

  2. Does normalisation occur twice in mtm align? Tm score formula divides by the length of the interested target as well by d0 the distance scales. This does occur in caretta as well as mTM align to produce the tm score between proteins. But mTM align also performs a secondary normalisation. As reported in their paper "We can calculate the number of structurally equivalent residues (Lali), the associated Root-Mean-Square Deviation (RMSD) and TM-score. Here the TM-score is normalized by the length of the smaller protein. Because the reference MSTAs are available for the HOMSTRAD dataset, we can define another metrics accuracy (ACC)." So if one wants to compare accuracy of Caretta against mTM-align and the HOMSTAD dataset the caretta results would need to be normalised by the smaller protein? If the results from caretta were not to be normalised I think they would be very bad in comparison to mTM-align but I need to be sure.

So I just want to check the ability of caretta to output a similarity score on a cluster of proteins and compare this to mTM align.

donal1 commented 2 years ago

I went through the mtm align code they seem to only compute the tm score of the smaller protein in the get_TMscore_from_seqxa function. This is also what they state in the paper.

While you guys as seen in the tm_score function in the multiple alignment .py, compute the tm score of both and return the higher. These means you guys are normalising by both lengths and choosing the max tm score.

Also I don't understand why you guys compute the square root of the common coordinates squared in line 47 and 48 in tm_score. This appears to be different than the reported tm algorithm.

Ninjani commented 2 years ago

Caretta produces pairwise tm scores. But these scores are not all vs all pairwise alignment like mTM align. As you said this won't be the most optimum result between the pairs as the multiple sequence alignment is meant to be used to compare all structures at once. So the pairwise matrices produced by each method are different? mtm align produces the optimal scores and caretta finds the conserved and variable residues across a set of proteins.

mTM align also produces multiple structure alignment TM scores as far as I know, as it is a multiple structure alignment algorithm, though with a different underlying method than caretta. For pairwise TM scores you would have to use TM-align (https://zhanggroup.org/TM-align/) instead.

While you guys as seen in the tm_score function in the multiple alignment .py, compute the tm score of both and return the higher. These means you guys are normalising by both lengths and choosing the max tm score.

Choosing the so called "target" protein for TM-score calculation could be by the shortest or longest or taking the max as we do. We don't use TM score in Caretta so we didn't experiment with changing these.

Also I don't understand why you guys compute the square root of the common coordinates squared in line 47 and 48 in tm_score. This appears to be different than the reported tm algorithm.

This was a bug, I've fixed this now on master thanks for spotting!

donal1 commented 2 years ago

I actually am happy with the non pairwise score, I want to get a tm score as it relates to a group of proteins so the optimum pairwise is unimportant but it is important that if the multi structure pairwise aliment is outputted that it be on par with mtm-align which it should be if your results are correct.

donal1 commented 2 years ago

I just want to compare the tm score given for the homstrad families and the reported tm scores from mtm align.

https://yanglab.nankai.edu.cn/mTM-align/benchmark/homstrad.html, the mtm and the homatad families are reported here.

I've also emailed you the mean pairwise scores that caretta outputs for each family. The scores are quite bad and I'm sure its due to the normalisation or bugs.

donal1 commented 2 years ago

Yeah thanks for all the help. I realise that the homstrad scores I was comparing too were inappropriate. It would make more sense to compare to accuracy.

lingnus1 commented 2 years ago

Okay this is what want to know. I'm trying to compare the reported mtm-align results with caretta just to verify. The homstrad dataset has reported tm score for multi alignment of proteins in the families as well m-Tm align also has these reported scores. I am trying to do the same for caretta and obviously I'm missing something. What is the function to get the multiple alignment tm score?

You can get the pairwise tm-scores from the python api by using the StructureMultiple class method make_rmsd_coverage_tm_matrix. If needed, I can make an example gist to show how you can use it.

Also I think I'm missing something just to clarify the pairwise score gotten from caretta isn't the same as the mtm-align pairwise matrix? As mtm-align is very time consuming and caretta is quite fast I was using caretta to get tm-scores between proteins.

Do you mean the commandline score matrix from caretta? If so, then those are caretta scores and not tm-scores. If you used the make_rmsd_coverage_tm_matrix then I will take a look at why there's a discrepancy between the outputs from the two tools (assuming the alignments are identical).

Okay this is what want to know. I'm trying to compare the reported mtm-align results with caretta just to verify. The homstrad dataset has reported tm score for multi alignment of proteins in the families as well m-Tm align also has these reported scores. I am trying to do the same for caretta and obviously I'm missing something. What is the function to get the multiple alignment tm score?

You can get the pairwise tm-scores from the python api by using the StructureMultiple class method make_rmsd_coverage_tm_matrix. If needed, I can make an example gist to show how you can use it.

Also I think I'm missing something just to clarify the pairwise score gotten from caretta isn't the same as the mtm-align pairwise matrix? As mtm-align is very time consuming and caretta is quite fast I was using caretta to get tm-scores between proteins.

Do you mean the commandline score matrix from caretta? If so, then those are caretta scores and not tm-scores. If you used the make_rmsd_coverage_tm_matrix then I will take a look at why there's a discrepancy between the outputs from the two tools (assuming the alignments are identical).

Hello, may I have an example for the usage of make_rmsd_coverage_tm_matrix? Thank you :)

Ninjani commented 2 years ago

@lingnus1 see #18