Open tonyelewis opened 8 years ago
The bad news is that it doesn't look like making the change above would be a clear, unalloyed improvement as we might have hoped.
On a benchmark of ~20,000 pairs (half homologue pairs; half non-homologue pairs), the scores from the modified version (y-axis) are not simply like those of the standard, DSSP-based version but with a few fixed failures:
Run on the same benchmark, the modified version (blue) is a bit worse on a precision-recall curve than the standard, DSSP-based SSAP:
For some examples from that benchmark, the change breaks cath-ssap on some pairs for which it previously worked:
dom1 dom2 prev_ssap_score
3b2vA01 3b2vA04 66.95
1dqbA01 1kkeA01 64.96
1aiwA00 3b2vA04 54.16
1ce3A00 1mt1A00 52.81
1b13A00 1qzfA02 48.50
1a1tA00 1cfhA00 47.12
1be3I00 1h6wA02 45.49
1gpzA02 2hrvA01 39.47
1h6wA02 1mmsA01 38.35
1cf4B00 1jeyA03 35.13
1ocyA02 2b1yA00 31.72
1jfwA00 2mev400 31.60
1kf6A04 2vqeN00 30.89
1ocyA02 2ehbD00 30.25
1fi8D00 1ocyA02 11.28
Looking at the case of 3b2vA01 versus 3b2vA04, it looks like the problem is as follows:
residues_have_similar_area_angle_props()
is now using the difference of accessibilities rather than the sum, many more pairs are meeting the criteriacompare_upper_cell
MIN_LOWER_MAT_RES_SCORE
Perhaps this indicates that the fix can be made more successful by adjusting the relevant global_res_sim_cutoff
value to achieve a roughly similar proportion of passes over a decent number of pairs of structures. This might fix the problems described here and might also prevent/mitigate the change slowing the algorithm down (which I have not yet investigated).
Or perhaps not.
More work is required.
Another pair that appear to fail due to this issue: 4c5qG00
versus 3grzA01
.
Output is:
2017-03-16 14:38:58.521369 [cath-ssap|warning] When populating upper_score_matrix (residue; pass 0), chose no residue pairs out of a possible 1691 to compare. This may relate to https://github.com/UCLOrengoGroup/cath-tools/issues/8 - please see that issue for more information and please add a comment if it's causing you problems (or open a new issue if this message is spurious).
2017-03-16 14:38:58.594028 [cath-ssap|warning] When populating upper_score_matrix (residue; pass 0), chose no residue pairs out of a possible 1691 to compare. This may relate to https://github.com/UCLOrengoGroup/cath-tools/issues/8 - please see that issue for more information and please add a comment if it's causing you problems (or open a new issue if this message is spurious).
2017-03-16 14:38:58.594989 [cath-ssap|warning] When populating upper_score_matrix (residue; pass 0), chose no residue pairs out of a possible 1691 to compare. This may relate to https://github.com/UCLOrengoGroup/cath-tools/issues/8 - please see that issue for more information and please add a comment if it's causing you problems (or open a new issue if this message is spurious).
4c5qG00 3grzA01 0 0 0.00 0 0 0 0.00 0.00
More pairs:
1vszI00 1hgvA00
3vtoA01 1gk6B00
3vtoA01 3u88M00
3vtoA01 3rl0h00
3vtoA01 2p22D00
3vtoA01 1hgvA00
1vszI00 3r1fE02
3vtoA01 3ksyA04
3vtoA01 3onqC01
3vtoA01 2e52A02
3vtoA01 2x51A07
More:
1fdoA05 3o6qC01
1zkcB02 3o6qC01
2axcA02 2o01N01
2o01N01 3o6qD01
3h25A02 3o6qC01
3i38I02 3o6qC01
Natalie has found an interesting cath-ssap failure on 4tp8U01 versus 2vhpU01. The program runs but generates zero scores despite the pair being very similar:
Running under debug mode (
--debug
) reveals:2vhpU01.sec
only lists one secondary structure.Rummaging in the code reveals that :
residues_have_similar_area_angle_props()
residues_have_similar_area_angle_props()
has issues (including that the comments currently say there are two problems and then list three).residues_have_similar_area_angle_props()
compares whether the sum of various property differences is below some threshold. Except that one of the properties, accessibility, is summed rather than differenced. It's clear that I have previously noticed this as strange when tidying up the legacy code I inherited - see the variable names, comments and the commentedaccessibility_difference
lines of code at the top and bottom of this chunk of code.It would make sense that this would cause problems with Natalie's pair of structures because the accessibilities that DSSP gives for their residues (which is what cath-ssap is using) are high.
Changing the code to use the
accessibility_difference
lines makes it work correctly, with a good SSAP score, a good RMSD, 100% sequence identity:and also a good superposition:
So this example gives motivation to look more carefully at this and consider fixing it. It's worth noting that the calculation is a bit more complicated because the
buried_i
andburied_j
variables that are differenced also involve the DSSP accessibility scores (see here). If we are able to look into this, it would be important to at least compare before/after ROC curves to avoid a regression in quality.Note: