Open shouldsee opened 7 years ago
From Natalie,
Hi all,
Earlier we were discussing whether or not we thought a particular protein chain should be a whole-chain domain because it is borderline in terms of how well packed it is (e.g. http://update.cathdb.info/cgi-bin/DomChop.pl?chain_id=4ug4D).
I suggested writing a python script to evaluate how well packed a particular protein chain/domain is using the Modeller python modules to calculate a score (i.e. a DOPE score).
In such cases where we aren't sure whether a domain is packed well enough to be chopped, it would be great to calculate the normalised DOPE (nDOPE) score and z-score using Modeller's python libraries, which will quantify how well packed the structure is.
The nDOPE score is typically used to assess the quality of structural models built through homology-modelling methods (such as MODELLER) as it will assess how native-like a particular PDB structure is. However we can also use it in this context to look at packing.
More details can be found in this manual regarding calculating the nDOPE and z-scores: https://salilab.org/modeller/9.15/manual.pdf
Also, this python script would be very useful in picking out domains already in CATH that are un-packed and that should probably be removed from the CATH classification. This would tie in nicely with the work that Ian mentioned on looking for errors in CATH superfamilies.
Thanks, Natalie
Example: examples/assessment/assess normalized dope.py
# Example for: model.assess_normalized_dope()
from modeller import *
from modeller.scripts import complete_pdb
env = environ()
env.libs.topology.read(file=’$(LIB)/top_heav.lib’)
env.libs.parameters.read(file=’$(LIB)/par.lib’)
# Read a model previously generated by Modeller’s automodel class
mdl = complete_pdb(env, ’../atom_files/1fdx.B99990001.pdb’)
zscore = mdl.assess_normalized_dope()
Sounds great - how do you want me to run the tests?
Nothing obvious in the README, so I tried:
$ git clone https://github.com/CATH-summer-2017/domchop.git
$ cd domchop
$ git checkout nDOPE
$ ./run_tests.sh
and got:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/cath/homes2/ucbcisi/git/domchop/tests/toy_test.py", line 5, in <module>
from toy import *
@sillitoe
Sounds great - how do you want me to run the tests?
Nothing obvious in the README, so I tried:
I have updated the tests and the README.md. Running ./run_tests.sh
should works better now. Please do make sure Modeller is configured as described in codes/README.md
BTW, the trace you posted seems incomplete and hope it disappears already.
I was able to put together a list of CATH domain entries with a nDOPE-score > 1.0. The latest cath-domain-pdb-S35.tgz containing 21155 structrues #7 was used in this test. Please check whether there is anything interesting.
To find out a structure from a domainID, you can do one of the following:
My observation:
Symbols explained:
To-do:
@CATH-summer-2017/all
This list looks really useful @shouldsee. Having the CATHSOLID ids as well as the domain length and resolution is great.
In terms of starting to look through the list, it would be helpful to group these really bad nDOPE scores by superfamily and identify which superfamilies have the highest proportion of badly scoring S35 reps. For example, it would be good to know the superfamily ID, the number of S35 reps in the superfamily, the number of S35 reps with these bad nDOPE scores > 1, and then the proportion of the superfamily S35s with bad score counts.
Once these proportions have been ordered so that the superfamilies with the highest proportion of bad scoring S35 reps are at the top of the list, one method of analysis could then be to look into these 'worst' superfamilies first.
@nataliedawson Glad you find it useful. Your suggestion sounds like basic aggregation of the data. I will happy to implement them in the near future, but for now I will be focusing on write SQL-like operation in Python's object oriented language. ---- The list was generated earlier with a mixture of bash and sql scripts, but I need to rewrite them in Python to render the database in structured HTML.
Also, as pointed out by @ilsenatorov , we do not understand exactly how nDOPE works and handle multiple fragments. This fact makes it hard to make sense of the nDOPE scores of some structures (tag with "why?" ). Thus it would be useful to establish some general routines to make sense of the nDOPE score. This task, nevertheless, should become easier with a better data browser.
This will be a good project to practice TDD (Test-driven development) and collaborative coding.
A starting point may be the score_modeller.py. We also need to figure out how to install Modeller in an automatic fashion in order to automate build test.