UCLOrengoGroup / cath-tools

Protein structure comparison tools such as SSAP and SNAP
http://cath-tools.readthedocs.io
GNU General Public License v3.0
59 stars 14 forks source link

Accessibility problems prevent cath-ssap aligning two similar structures #8

Open tonyelewis opened 8 years ago

tonyelewis commented 8 years ago

Natalie has found an interesting cath-ssap failure on 4tp8U01 versus 2vhpU01. The program runs but generates zero scores despite the pair being very similar:

4tp8U01  2vhpU01    0    0   0.00    0    0    0   0.00

Running under debug mode (--debug) reveals:

Rummaging in the code reveals that :

It would make sense that this would cause problems with Natalie's pair of structures because the accessibilities that DSSP gives for their residues (which is what cath-ssap is using) are high.

Changing the code to use the accessibility_difference lines makes it work correctly, with a good SSAP score, a good RMSD, 100% sequence identity:

4tp8U01  2vhpU01   39   39  81.94   39  100  100   1.65
LRRFKRSCEKAGVLAEVRRREFYEKPTTERKRAKASAVK
LRRFKRSCEKAGVLAEVRRREFYEKPTTERKRAKASAVK

and also a good superposition:

4tp8u01_2vhpu01

So this example gives motivation to look more carefully at this and consider fixing it. It's worth noting that the calculation is a bit more complicated because the buried_i and buried_j variables that are differenced also involve the DSSP accessibility scores (see here). If we are able to look into this, it would be important to at least compare before/after ROC curves to avoid a regression in quality.

Note:

tonyelewis commented 8 years ago

The bad news is that it doesn't look like making the change above would be a clear, unalloyed improvement as we might have hoped.

Exhibit A

On a benchmark of ~20,000 pairs (half homologue pairs; half non-homologue pairs), the scores from the modified version (y-axis) are not simply like those of the standard, DSSP-based version but with a few fixed failures:

scatter_plot

Exhibit B

Run on the same benchmark, the modified version (blue) is a bit worse on a precision-recall curve than the standard, DSSP-based SSAP:

precision_recall_curve

Exhibit C

For some examples from that benchmark, the change breaks cath-ssap on some pairs for which it previously worked:

dom1    dom2    prev_ssap_score
3b2vA01 3b2vA04 66.95
1dqbA01 1kkeA01 64.96
1aiwA00 3b2vA04 54.16
1ce3A00 1mt1A00 52.81
1b13A00 1qzfA02 48.50
1a1tA00 1cfhA00 47.12
1be3I00 1h6wA02 45.49
1gpzA02 2hrvA01 39.47
1h6wA02 1mmsA01 38.35
1cf4B00 1jeyA03 35.13
1ocyA02 2b1yA00 31.72
1jfwA00 2mev400 31.60
1kf6A04 2vqeN00 30.89
1ocyA02 2ehbD00 30.25
1fi8D00 1ocyA02 11.28

Looking at the case of 3b2vA01 versus 3b2vA04, it looks like the problem is as follows:

Perhaps this indicates that the fix can be made more successful by adjusting the relevant global_res_sim_cutoff value to achieve a roughly similar proportion of passes over a decent number of pairs of structures. This might fix the problems described here and might also prevent/mitigate the change slowing the algorithm down (which I have not yet investigated).

Or perhaps not.

More work is required.

tonyelewis commented 7 years ago

Another pair that appear to fail due to this issue: 4c5qG00 versus 3grzA01.

Output is:

2017-03-16 14:38:58.521369 [cath-ssap|warning] When populating upper_score_matrix (residue; pass 0), chose no residue pairs out of a possible 1691 to compare. This may relate to https://github.com/UCLOrengoGroup/cath-tools/issues/8 - please see that issue for more information and please add a comment if it's causing you problems (or open a new issue if this message is spurious).
2017-03-16 14:38:58.594028 [cath-ssap|warning] When populating upper_score_matrix (residue; pass 0), chose no residue pairs out of a possible 1691 to compare. This may relate to https://github.com/UCLOrengoGroup/cath-tools/issues/8 - please see that issue for more information and please add a comment if it's causing you problems (or open a new issue if this message is spurious).
2017-03-16 14:38:58.594989 [cath-ssap|warning] When populating upper_score_matrix (residue; pass 0), chose no residue pairs out of a possible 1691 to compare. This may relate to https://github.com/UCLOrengoGroup/cath-tools/issues/8 - please see that issue for more information and please add a comment if it's causing you problems (or open a new issue if this message is spurious).
4c5qG00  3grzA01    0    0   0.00    0    0    0   0.00   0.00
tonyelewis commented 7 years ago

More pairs:

1vszI00 1hgvA00
3vtoA01 1gk6B00
3vtoA01 3u88M00
3vtoA01 3rl0h00
3vtoA01 2p22D00
3vtoA01 1hgvA00
1vszI00 3r1fE02
3vtoA01 3ksyA04
3vtoA01 3onqC01
3vtoA01 2e52A02
3vtoA01 2x51A07
tonyelewis commented 7 years ago

More:

1fdoA05 3o6qC01
1zkcB02 3o6qC01
2axcA02 2o01N01
2o01N01 3o6qD01
3h25A02 3o6qC01
3i38I02 3o6qC01