UCLOrengoGroup / cath-tools

Protein structure comparison tools such as SSAP and SNAP
http://cath-tools.readthedocs.io
GNU General Public License v3.0
57 stars 14 forks source link

Consider adding MDA information to cath-resolve-hits #28

Open tonyelewis opened 7 years ago

tonyelewis commented 7 years ago

Jon says:

One thing that might be useful for researchers in general, could be a summary of the different MDAs with counts ? or a ranked list of how often different models are used in the final set of resolved MDAs.

tonyelewis commented 7 years ago

Jon and I have been talking about CRH not knowing that matches' families, which is what we'd ideally want for the MDAs.

Jon pointed out that Pfam IDs already include the family ID and that he's thinking of adding the CATH superfamilies to the IDs in his Gene3D pipeline.

Using those IDs alone for the MDAs would make family-equivalent MDAs look different (eg 1cukA01__2.40.50.140 vs 1bvsA01__2.40.50.140) but Jon says they'd still be useful.

We've also discussed the idea of getting the actual family IDs into CRH by having it read a match_id → family_id file and/or allowing the user to specify a regexp to convert model IDs to family IDs. Jon thinks this would be very useful.

This issue is complicated by the use of discontinuous domains in the new Gene3D pipeline, though Jon says that the above stuff would still be useful, even before we've implemented a decent solution to the "discontinuous" issue.

Thinking about it, one approach would be to put the discontinuous domain's sub-MDA as it's family in the match_id → family_id file (in the same format as CRH uses) so the final generated MDA is correct (though this would be trickier if the MDAs numbered domains to make the MDAs completely explicit, not just of the form SF_A[seg1],SF_A[seg1],SF_A[seg1],SF_A[seg2],SF_A[seg2],SF_A[seg2]).

Edit: Jon has had to explain to me (not, I think, for the first time :disappointed:) that the approach I suggested in that last paragraph is insufficient because:

sometimes not all the discontinuous HMM is matched, (e.g. if some splice variant truncates a protein), so that’s why unfortunately you really need to look at the HMM alignment to get the superfamilies implied by the discontinuous HMM hit