choderalab / ensembler

Automated omics-scale protein modeling and simulation setup.
http://ensembler.readthedocs.io/
GNU General Public License v2.0
53 stars 21 forks source link

Add an option to specify (or "pin") the protonation states / variants of specified residues #35

Open jchodera opened 9 years ago

jchodera commented 9 years ago

@sonyahanson would like to select the protonation states of some residues (such as the DFG Asp in kinases).

Currently, the code uses the highest sequence identity PDB file model.pdb.gz to extract reference residue variants that are used to assign the protonation state variants of all residues in all models.

To manually make alterations, this would mean editing the model.pdb.gz of the highest sequence identity model(s) manually after the modelling stage in order to change residue names appropriately (e.g. change ASP to ASH to select the protonated aspartic acid).

In future, it would be useful to allow these alterations to be specified either via the command-line or a YAML file or something.

danielparton commented 9 years ago

Sure, shouldn't be too much work. My plan is to implement a "project_settings.yaml" file to contain this sort of info. This would go in the project top-level directory, and would be created by ensembler init. It would also be a good way to specify pH, since the default is 7.0 and we will want to use 8.0 for many of our projects. I'm thinking of the following format:

all_targets:
    pH: 8.0
per_target:
    DDR1_HUMAN_D0_PROTONATED:
        custom_residue_variants:
            # keyed by 0-based residue index
            35: ASH
    CSK21_HUMAN_D0:
        # this value would take priority over the value in "all_targets" (just for illustration)
        pH: 6.5
jchodera commented 9 years ago

Nice!

Any chance we could us Uniprot residue numbering instead? Or is that just a pain?

danielparton commented 9 years ago

What if the target is not a standard UniProt sequence?

jchodera commented 9 years ago

How are nonstandard sequences specified now? FASTA file? Is there some provision for numbering in those? Or must they be zero indexed?

danielparton commented 9 years ago

Yes currently nonstandard sequences would have to be added manually in a FASTA file, which does not allow for residue numbering.

There is a script for outputting topologies with residues numbers according to UniProt (the targetid must match a UniProt entry name) - it would be simple to modify this to accept a custom numbering scheme. If you just wanted to ensure these residue numbers are used for F@h projects, this might be the quickest/simplest approach.

jchodera commented 9 years ago

Is there a standard besides FASTA?

I wonder if a SEQRES block (with the additional DBREF information to get numbering) would be more general.

Both the UniProt and FASTA sequence ingestion schemes could in principle generate this format, and it would be easier for users to modify both the numbering and three letter residue codes.

danielparton commented 9 years ago

Ok, could you write this as a separate issue?

jchodera commented 9 years ago

Ok, could you write this as a separate issue?

Done! #39