CATH-summer-2017 / domchop

2 stars 3 forks source link

Integrate DOPE from modeller to help with the domchop process. #1

Open shouldsee opened 7 years ago

shouldsee commented 7 years ago

This will be a good project to practice TDD (Test-driven development) and collaborative coding.

A starting point may be the score_modeller.py. We also need to figure out how to install Modeller in an automatic fashion in order to automate build test.

shouldsee commented 7 years ago

From Natalie,

Hi all,

Earlier we were discussing whether or not we thought a particular protein chain should be a whole-chain domain because it is borderline in terms of how well packed it is (e.g. http://update.cathdb.info/cgi-bin/DomChop.pl?chain_id=4ug4D).

I suggested writing a python script to evaluate how well packed a particular protein chain/domain is using the Modeller python modules to calculate a score (i.e. a DOPE score).

In such cases where we aren't sure whether a domain is packed well enough to be chopped, it would be great to calculate the normalised DOPE (nDOPE) score and z-score using Modeller's python libraries, which will quantify how well packed the structure is.

The nDOPE score is typically used to assess the quality of structural models built through homology-modelling methods (such as MODELLER) as it will assess how native-like a particular PDB structure is. However we can also use it in this context to look at packing.

More details can be found in this manual regarding calculating the nDOPE and z-scores: https://salilab.org/modeller/9.15/manual.pdf

Also, this python script would be very useful in picking out domains already in CATH that are un-packed and that should probably be removed from the CATH classification. This would tie in nicely with the work that Ian mentioned on looking for errors in CATH superfamilies.

Thanks, Natalie

shouldsee commented 7 years ago

Example: examples/assessment/assess normalized dope.py

# Example for: model.assess_normalized_dope()
from modeller import *
from modeller.scripts import complete_pdb
env = environ()
env.libs.topology.read(file=’$(LIB)/top_heav.lib’)
env.libs.parameters.read(file=’$(LIB)/par.lib’)
# Read a model previously generated by Modeller’s automodel class
mdl = complete_pdb(env, ’../atom_files/1fdx.B99990001.pdb’)
zscore = mdl.assess_normalized_dope()
sillitoe commented 7 years ago

Sounds great - how do you want me to run the tests?

Nothing obvious in the README, so I tried:

$ git clone https://github.com/CATH-summer-2017/domchop.git
$ cd domchop
$ git checkout nDOPE
$ ./run_tests.sh

and got:

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/cath/homes2/ucbcisi/git/domchop/tests/toy_test.py", line 5, in <module>
    from toy import *
shouldsee commented 7 years ago

@sillitoe

Sounds great - how do you want me to run the tests?

Nothing obvious in the README, so I tried:

I have updated the tests and the README.md. Running ./run_tests.sh should works better now. Please do make sure Modeller is configured as described in codes/README.md

BTW, the trace you posted seems incomplete and hope it disappears already.

shouldsee commented 7 years ago

I was able to put together a list of CATH domain entries with a nDOPE-score > 1.0. The latest cath-domain-pdb-S35.tgz containing 21155 structrues #7 was used in this test. Please check whether there is anything interesting.

To find out a structure from a domainID, you can do one of the following:

  1. Download the latest S35 pdb collection from CATH download page. You need to unzip before you can browse it.
  2. Search CATH for a particular domainID without downloading.

My observation:

  1. Many unpacked helixes in Class 1.
  2. Non-sanitised PDB with multiple overlapping structures give superhigh nDOPE score ( 4.0 -6.0 )
  3. Entry with larger s35 index is often more interesting
  4. Entry with larger domain_len is often more interesting

Symbols explained:

  1. "?" : unclear reason for high nDOPE
  2. "!" : note-worthy case with high nDOPE
  3. extrusion,loop: the structure contains insertion that looks unstable
  4. fragmented: the structure is very discontinuous with break points
  5. separate: the structure contain separable domains.

To-do:

  1. Sort the list. (What do you think is the best way to sort?)
    • maybe sort by dom length and then s35 index

@CATH-summer-2017/all

nataliedawson commented 7 years ago

This list looks really useful @shouldsee. Having the CATHSOLID ids as well as the domain length and resolution is great.

In terms of starting to look through the list, it would be helpful to group these really bad nDOPE scores by superfamily and identify which superfamilies have the highest proportion of badly scoring S35 reps. For example, it would be good to know the superfamily ID, the number of S35 reps in the superfamily, the number of S35 reps with these bad nDOPE scores > 1, and then the proportion of the superfamily S35s with bad score counts.

Once these proportions have been ordered so that the superfamilies with the highest proportion of bad scoring S35 reps are at the top of the list, one method of analysis could then be to look into these 'worst' superfamilies first.

shouldsee commented 7 years ago

@nataliedawson Glad you find it useful. Your suggestion sounds like basic aggregation of the data. I will happy to implement them in the near future, but for now I will be focusing on write SQL-like operation in Python's object oriented language. ---- The list was generated earlier with a mixture of bash and sql scripts, but I need to rewrite them in Python to render the database in structured HTML.

Also, as pointed out by @ilsenatorov , we do not understand exactly how nDOPE works and handle multiple fragments. This fact makes it hard to make sense of the nDOPE scores of some structures (tag with "why?" ). Thus it would be useful to establish some general routines to make sense of the nDOPE score. This task, nevertheless, should become easier with a better data browser.