Closed rafwiewiora closed 8 years ago
Tagging @jchodera
what is called the I-SET domain - that's between the SET domain and the post-SET domain - helix important in binding of the peptide, it 'hugs' it from one side and likely to be important in recognition of particular sequences on histones - does a bunch of unwindings and rewindings in the simulation.
I don't understand this part. It seems like a fragment of something else?
how do we put the notion of 'biologically relevant' conformation changes into what we're getting out here? or should we set up some YANK calculation with a known inhibitor for the most populated states? (here the problem is the C-terminal loop that we loose without the cofactor - so maybe this dataset might only be useful for allosteric binding, or at least any binding not in that cofactor site...)
Are there any configurations that have a fully-formed binding site in the dataset? I think this is something you have to establish by computing some metric to all the frames---maybe a minimum RMSD of the residues that make up just the binding site, or some contacts between binding site residues. If there are no residues that make up the binding site, that means that the forcefield is doing a poor job of modeling the system or the free energy cost to form the binding site is enormous (which would seem questionable).
Did we start from any configurations that had the binding site fully formed? If not, can we add some? If so, did they all just come apart?
how to we rank the MSMs we get out - been reading on the GMRQ stuff and we've planned to work through those papers with @maxentile while in Boston when we have time
Our basic plan is:
next TO DOs: exploring clustering approaches, minRMSD clustering, K Medoids clustering of the min res distance
Definitely give minRMSD
a go. You can use the code from here as an example, playing with the generator ratio and maybe trying minibatch K-medoids clustering in minRMSD
instead of equitemporal generator selection as well.
MOST IMPORTANT THING RIGHT NOW: trying to make one more figure for the poster, something with small molecule == drug design, maybe a quick YANK run? or I could run an allosteric site finder? -- or at least an idea for going forward - how do we take the ready stationary distribution of the MSM and go ahead with calculating the free energies for known inhibitors? Should sketch a more detailed plan for those stages.
Running an allosteric site finder on a few configurations sampled from each lumped metastable state with low free energy would be neat. Greg Bowman will be at the meeting, so alternatively, you can ask him what tool he uses to do this and just pitch that step as something you want to do. Maybe we should even collaborate with him on that!
I would focus on trying to figure out how many metastable states > 50 ns there are, and what the ladder of free energies looks like. I'm not sure if @maxentile has scripts to do this already---that was supposed to be the real deliverable from the MSM pipeline he was engineering, but he may not have finished this yet.
Extract a lumped model with a number of metastable states determined in the previous step. I actually like the MCSA (Monte Carlo simulated annealing) scheme I used in my PhD thesis---@maxentile has implemented this, but I don't think he has yet contributed it back to msmbuilder or pyemma (which is something we need to do).
Yep! I re-implemented this method in Python here, along with some of the additional metastability objectives suggested in your paper: https://github.com/maxentile/automatic-state-decomposition/blob/master/decompose-py/lumping.py
If we'd like to incorporate this into msmbuilder
/ pyemma
I'll probably want to make sure the default hyperparameters are sensible (I've made no effort to optimize the MCSA annealing schedule yet, for example).
John, is there anything that needs to be added or modified before I contribute this?
I would focus on trying to figure out how many metastable states > 50 ns there are, and what the ladder of free energies looks like. I'm not sure if @maxentile has scripts to do this already---that was supposed to be the real deliverable from the MSM pipeline he was engineering, but he may not have finished this yet.
To estimate the number of macrostates more metastable than some threshold, does something like this suffice? sum(msm.timescales()>metastability_threshold)
That's what's currently in the pipeline, with a default threshold of 100ns: https://github.com/maxentile/msm-pipeline/blob/master/generate_report.py#L97-L98
If we'd like to incorporate this into msmbuilder / pyemma I'll probably want to make sure the default hyperparameters are sensible (I've made no effort to optimize the MCSA annealing schedule yet, for example).
I think we should be adding things into msmbuilder/pyemma this as a matter of course, even for research tools! This scheme was also published, so we can simply implement the same hyperparameters that were used in the paper if we wanted sensible defaults.
John, is there anything that needs to be added or modified before I contribute this?
I'd clean it up to match the implementation scheme described in the paper.
Note that your definition of "metastability" here is at odds with the literature definition (and the docstring and code don't match) because you normalize the metastability. I would keep the literature definition and not normalize it.
That's what's currently in the pipeline, with a default threshold of 100ns:
Thanks, @maxentile!
@rafwiewiora: If you're looking to make some plots, you can probably use these function that @maxentile has assembled.
Analyzing here the dataset as is in the
munged-with-time
right now - some quick notes I made on trajectory lengths:closest-heavy
residue min. distance featurization done - reduced to 1000 dimensions using @maxentile 's script that chooses the top 1000 distances crossing the 4 A cutoff. Then tICA with lag time 50 frames.sincos=True
featurization, tICA with lag 50 frames and 400 frames (100 ns).Thanks to @maxentile for helping me out a lot with this! Here's notebooks with work so far and links to
.dcd
's on HAL for structuresImplied timescales for 500 clusters:
went with lag time 400 frames // 100 ns - looked at 10th slowest relaxation processes - left eigenvectors plotted in the notebook - and example frames for the argmax and argmin states for the 10 processes (argmax first, then argmin) are here:
(and the topology:
)
and 1 frame per each 473ish active states:
here's where one of my biggest understanding problems now begins - I'm looking at those states and I can see things are changing, but how do we proceed in inputting 'biological relevance' into this??
Some general observations: N-terminal helix pretty flexible, C-terminal small helix very flexible - that in fact makes 1/3 of the SAM binding domain, but appears to only largely bind in the conformation where that C-terminal helix encloses the SAM from one side only in the presence of SAM (NMR data estimated a 1 / sec timescale for going into that C-terminal helix enclosing the cofactor binding site conformation without the cofactor present.) So:
core of the construct is a few Beta sheets - there's a bunch of interesting moves happening - check out the trajectories on the links above
what is called the I-SET domain - that's between the SET domain and the post-SET domain - helix important in binding of the peptide, it 'hugs' it from one side and likely to be important in recognition of particular sequences on histones - does a bunch of unwindings and rewindings in the simulation.
I also checked out coarse-graining of the MSM into HMM - that only worked ok starting with a 100 state-MSM, and coarse grained to 3 states - notebook: https://github.com/rafwiewiora/pimento/blob/new_sims_for_setd8/MSM/SETD8/dtrajs100_unitime_msm.ipynb
here samples from the 3 states - 10 per state:
and again my question is - what's next? what's biologically relevant
Ok so that's the update for now. What is in my interest zone now is:
MOST IMPORTANT THING RIGHT NOW: trying to make one more figure for the poster, something with small molecule == drug design, maybe a quick YANK run? or I could run an allosteric site finder? -- or at least an idea for going forward - how do we take the ready stationary distribution of the MSM and go ahead with calculating the free energies for known inhibitors? Should sketch a more detailed plan for those stages.