SETD8 MSM discussion - Githubissues

rafwiewiora commented 8 years ago

Analyzing here the dataset as is in the munged-with-time right now - some quick notes I made on trajectory lengths:

That’s 969 clones active, 3920256 frames total, /4: 980.064 microseconds. 

Longest clone: 5.330 microseconds
Shortest clone: 0.009 microseconds (9 nanoseconds) 

Mean: 1.011 microseconds

400 ns cutoff - how much data do we keep: 926.574 microseconds - pretty good. That’s in 457 trajectories only!! 457 / 969 = 47%!!!

So considering only that at least 400 ns (1600 frames) set - mean is: 2.0275 microseconds.

closest-heavy residue min. distance featurization done - reduced to 1000 dimensions using @maxentile 's script that chooses the top 1000 distances crossing the 4 A cutoff. Then tICA with lag time 50 frames.
phi, psi, chi1 sincos=True featurization, tICA with lag 50 frames and 400 frames (100 ns).
Clustering only done with those Euclidan metrics - minRMSD clustering I did with the bug - TODO: repeat now PyEmma is ok.
So far tried RegTime and KMeans - tried 100 and 500 centers
Generally RegTime gives a larger active set in the MSM (~95%) vs 70-75% KMeans -- that's the active states fraction (but active counts fraction around 95% with Kmeans vs 99% RegTime)

Thanks to @maxentile for helping me out a lot with this! Here's notebooks with work so far and links to .dcd's on HAL for structures

Min. residue distance featurization as described above, regular time clustering - 100 and 500 clusters: https://github.com/rafwiewiora/pimento/blob/new_sims_for_setd8/MSM/SETD8/setd8_analysis.ipynb

Implied timescales for 500 clusters:

went with lag time 400 frames // 100 ns - looked at 10th slowest relaxation processes - left eigenvectors plotted in the notebook - and example frames for the argmax and argmin states for the 10 processes (argmax first, then argmin) are here:

/cbio/jclab/home/rafal.wiewiora/uploads/pimento/MSM/SETD8/resmindist/argmaxmin_20frames.dcd

(and the topology:

/cbio/jclab/home/rafal.wiewiora/uploads/pimento/MSM/SETD8/top.pdb

)

and 1 frame per each 473ish active states:

/cbio/jclab/home/rafal.wiewiora/uploads/pimento/MSM/SETD8/resmindist/all_states.dcd

here's where one of my biggest understanding problems now begins - I'm looking at those states and I can see things are changing, but how do we proceed in inputting 'biological relevance' into this??

Some general observations: N-terminal helix pretty flexible, C-terminal small helix very flexible - that in fact makes 1/3 of the SAM binding domain, but appears to only largely bind in the conformation where that C-terminal helix encloses the SAM from one side only in the presence of SAM (NMR data estimated a 1 / sec timescale for going into that C-terminal helix enclosing the cofactor binding site conformation without the cofactor present.) So:

are we not going to be able to tell anything about the SAM binding site (and that's the binding site for all inhibitors that have been proposed) from this simulation?? We lost 1/3 of the binding site essentially. -- I'm now starting the whole cycle of SETD8 - but do we essentially already see that we gotta have that cofactor or some general 'mimic of a cofactor of drug molecule' or some restraints put on if we want to sample the conformations around the active site?

core of the construct is a few Beta sheets - there's a bunch of interesting moves happening - check out the trajectories on the links above

what is called the I-SET domain - that's between the SET domain and the post-SET domain - helix important in binding of the peptide, it 'hugs' it from one side and likely to be important in recognition of particular sequences on histones - does a bunch of unwindings and rewindings in the simulation.

I also checked out coarse-graining of the MSM into HMM - that only worked ok starting with a 100 state-MSM, and coarse grained to 3 states - notebook: https://github.com/rafwiewiora/pimento/blob/new_sims_for_setd8/MSM/SETD8/dtrajs100_unitime_msm.ipynb

here samples from the 3 states - 10 per state:

/cbio/jclab/home/rafal.wiewiora/uploads/pimento/MSM/SETD8/resmindist/10samples_hmm_trajs.dcd

and again my question is - what's next? what's biologically relevant

As for dihedrals: nothing that's working as well as the above yet - here one example that led all the way to a coarse-grained HMM but it's weird - noisely estimating some v. slow process there it seems and one state of 500 has over 60% stationary occupancy - so just mentioning this, nothing very useful yes. Regular time clustering, 500 states, MSM coarse-grained to 4 state HMM. Notebook: https://github.com/rafwiewiora/pimento/blob/new_sims_for_setd8/MSM/SETD8/unitime500.ipynb and samples of the states - 10 per state on HAL:

/cbio/jclab/home/rafal.wiewiora/uploads/pimento/MSM/SETD8/dihedrals/10samples_hmm_structures_unitime.dcd

Ok so that's the update for now. What is in my interest zone now is:

how do we put the notion of 'biologically relevant' conformation changes into what we're getting out here? or should we set up some YANK calculation with a known inhibitor for the most populated states? (here the problem is the C-terminal loop that we loose without the cofactor - so maybe this dataset might only be useful for allosteric binding, or at least any binding not in that cofactor site...)
how to we rank the MSMs we get out - been reading on the GMRQ stuff and we've planned to work through those papers with @maxentile while in Boston when we have time
next TO DOs: exploring clustering approaches, minRMSD clustering, K Medoids clustering of the min res distance

MOST IMPORTANT THING RIGHT NOW: trying to make one more figure for the poster, something with small molecule == drug design, maybe a quick YANK run? or I could run an allosteric site finder? -- or at least an idea for going forward - how do we take the ready stationary distribution of the MSM and go ahead with calculating the free energies for known inhibitors? Should sketch a more detailed plan for those stages.

rafwiewiora commented 8 years ago

Tagging @jchodera

jchodera commented 8 years ago

what is called the I-SET domain - that's between the SET domain and the post-SET domain - helix important in binding of the peptide, it 'hugs' it from one side and likely to be important in recognition of particular sequences on histones - does a bunch of unwindings and rewindings in the simulation.

I don't understand this part. It seems like a fragment of something else?

jchodera commented 8 years ago

how do we put the notion of 'biologically relevant' conformation changes into what we're getting out here? or should we set up some YANK calculation with a known inhibitor for the most populated states? (here the problem is the C-terminal loop that we loose without the cofactor - so maybe this dataset might only be useful for allosteric binding, or at least any binding not in that cofactor site...)

Are there any configurations that have a fully-formed binding site in the dataset? I think this is something you have to establish by computing some metric to all the frames---maybe a minimum RMSD of the residues that make up just the binding site, or some contacts between binding site residues. If there are no residues that make up the binding site, that means that the forcefield is doing a poor job of modeling the system or the free energy cost to form the binding site is enormous (which would seem questionable).

Did we start from any configurations that had the binding site fully formed? If not, can we add some? If so, did they all just come apart?

jchodera commented 8 years ago

how to we rank the MSMs we get out - been reading on the GMRQ stuff and we've planned to work through those papers with @maxentile while in Boston when we have time

Our basic plan is:

Figure out how many eigenvalues (metastable states) have a timescale longer than some cutoff that we won't be able to reach in a free energy calculation. You set your threshold at 400 frames (100 ns), which is actually pretty long since our YANK calculations are usually not quite that long. We may want to set this to tau = 20 ns (80 frames) and then see how many states are slower than ~50 ns since we're unlikely to overcome dynamic interconvertion times longer than that in a single YANK calculation.
Extract a lumped model with a number of metastable states determined in the previous step. I actually like the MCSA (Monte Carlo simulated annealing) scheme I used in my PhD thesis---@maxentile has implemented this, but I don't think he has yet contributed it back to msmbuilder or pyemma (which is something we need to do).
Plot the free energies of these lumped macrostates in order from the lowest to highest. We want to use a threshold of maybe 6 kT or 10 kT---whatever price we think we can afford to pay with a ligand
We then try to run binding free energy calculations starting from structures sampled from these lumped macrostates. We can also do docking/design against them with Perses.

jchodera commented 8 years ago

next TO DOs: exploring clustering approaches, minRMSD clustering, K Medoids clustering of the min res distance

Definitely give minRMSD a go. You can use the code from here as an example, playing with the generator ratio and maybe trying minibatch K-medoids clustering in minRMSD instead of equitemporal generator selection as well.

jchodera commented 8 years ago

MOST IMPORTANT THING RIGHT NOW: trying to make one more figure for the poster, something with small molecule == drug design, maybe a quick YANK run? or I could run an allosteric site finder? -- or at least an idea for going forward - how do we take the ready stationary distribution of the MSM and go ahead with calculating the free energies for known inhibitors? Should sketch a more detailed plan for those stages.

Running an allosteric site finder on a few configurations sampled from each lumped metastable state with low free energy would be neat. Greg Bowman will be at the meeting, so alternatively, you can ask him what tool he uses to do this and just pitch that step as something you want to do. Maybe we should even collaborate with him on that!

I would focus on trying to figure out how many metastable states > 50 ns there are, and what the ladder of free energies looks like. I'm not sure if @maxentile has scripts to do this already---that was supposed to be the real deliverable from the MSM pipeline he was engineering, but he may not have finished this yet.

maxentile commented 8 years ago

Extract a lumped model with a number of metastable states determined in the previous step. I actually like the MCSA (Monte Carlo simulated annealing) scheme I used in my PhD thesis---@maxentile has implemented this, but I don't think he has yet contributed it back to msmbuilder or pyemma (which is something we need to do).

Yep! I re-implemented this method in Python here, along with some of the additional metastability objectives suggested in your paper: https://github.com/maxentile/automatic-state-decomposition/blob/master/decompose-py/lumping.py

If we'd like to incorporate this into msmbuilder / pyemma I'll probably want to make sure the default hyperparameters are sensible (I've made no effort to optimize the MCSA annealing schedule yet, for example).

John, is there anything that needs to be added or modified before I contribute this?

I would focus on trying to figure out how many metastable states > 50 ns there are, and what the ladder of free energies looks like. I'm not sure if @maxentile has scripts to do this already---that was supposed to be the real deliverable from the MSM pipeline he was engineering, but he may not have finished this yet.

To estimate the number of macrostates more metastable than some threshold, does something like this suffice? sum(msm.timescales()>metastability_threshold)

That's what's currently in the pipeline, with a default threshold of 100ns: https://github.com/maxentile/msm-pipeline/blob/master/generate_report.py#L97-L98

jchodera commented 8 years ago

If we'd like to incorporate this into msmbuilder / pyemma I'll probably want to make sure the default hyperparameters are sensible (I've made no effort to optimize the MCSA annealing schedule yet, for example).

I think we should be adding things into msmbuilder/pyemma this as a matter of course, even for research tools! This scheme was also published, so we can simply implement the same hyperparameters that were used in the paper if we wanted sensible defaults.

jchodera commented 8 years ago

John, is there anything that needs to be added or modified before I contribute this?

I'd clean it up to match the implementation scheme described in the paper.

Note that your definition of "metastability" here is at odds with the literature definition (and the docstring and code don't match) because you normalize the metastability. I would keep the literature definition and not normalize it.

jchodera commented 8 years ago

That's what's currently in the pipeline, with a default threshold of 100ns:

Thanks, @maxentile!

@rafwiewiora: If you're looking to make some plots, you can probably use these function that @maxentile has assembled.

choderalab / pimento

SETD8 MSM discussion #10