choderalab / MSMs

Markov state models
GNU General Public License v2.0
3 stars 7 forks source link

Use contact map features? #8

Open maxentile opened 9 years ago

maxentile commented 9 years ago

John relayed Frank's suggestion to use a subset of inter-residue contacts as input features for MSM-building: we would like to use just "the subset of contacts that change during a simulation." I've been looking into this a bit for modeling the data from 10468 (Abl).

tl;dr -- promising!

My biggest question is:

How should we choose the contact threshold?

To determine whether a given inter-residue contact "changes," we binarize the inter-residue distance as either > or <= a threshold distance, then select any residue pairs that aren't all 0s or all 1s throughout our simulations. PyEMMA has a default setting for this of 5 angstroms. However, the subset of contacts that will appear to be "changing" clearly depends on what threshold you use:

contact_map_thresh_movie_abl

Note that the set of selected contacts at a threshold x+epsilon is not simply a superset of the contacts selected at a lower threshold x: image

Here I've just set the threshold=1.0nm since that induces an acceptably small number of features (~4000) that seem to represent the extent of the protein, but I don't know how to do this in a principled way.

image

Comparing feature sets:

A feature set that does a good job at linearly representing the slow relaxation processes of a system will have a high GMRQ, i.e. a large sum of leading eigenvalues in the tICA solution.

I compared 3 feature sets in terms of their GMRQ for fixed-rank approximation (m=100) of the lag_time=100 propagator. I also computed the maximum cumulative sum of leading eigenvalues.

Feature set 1: Backbone torsions (all available: ‘phi’, ‘psi’, ‘omega’, ‘chi1’, ‘chi2’, ‘chi3’, ‘chi4’)

The sum of the first 100 eigenvalues = 55.7.

The sum of the first 542 eigenvalues = 76.3.

image

Feature set 2: Inter-residue distances between contacts that change (threshold=1.0nm) (evaluated on a subset of 10 trajectories) The sum of the first 100 eigenvalues = 61.9.

The sum of the first 1999 eigenvalues = 92.2.

image

image

Feature set 3: Kabsch-Sander hydrogen-bond energies The sum of the first 100 eigenvalues = 40.4.

The sum of the first 468 eigenvalues = 48.5.

image

image

Visualizing the results

From the eigenvalue plots above we can see that the first 2 tICs represent only a tiny fraction of the kinetic behavior of the kinase, for any of these choices of input features-- projecting onto the first 2 tICs is a very lossy representation of the tICA model. Can we create a more trustworthy 2D picture?

Here I'm using the first 100 tICs as inputs into t-SNE, which produces a nonlinear 2D embedding that preserves local neighborhoods at the cost of introducing global distortions. This is useful if we want a picture of the cluster structure of a dataset, but means we can't read into the x and y axes as "components." Each dot represents a simulation frame, dots are colored by the order in which they were collected, and time-adjacent frames are connected by a thin grey line. Areas that are visited more than once should have a mix of colors in them. Areas that are color-homogenuous appear to have been visited during just one contiguous chunk of trajectory.

Backbone torsions image

Contact maps image

Note that the above pictures were computed using just 10 out of the ~500 trajectories available.

If we recompute the "Contact map --> top-100 tICs --> t-SNE" projection for all available simulation data, then we get the following picture:

image

This picture suggests that the majority of the available data samples transitions between large relevant states (the dense multi-colored regions in the map), and there may be a large number of negligibly occupied microstates that are visited ~once (the little "sprinkles" on the periphery). Caveat: It's possible that t-SNE is over-partitioning the dataset (i.e. hallucinating clusters)-- although I ran a quick sanity check that the k-nearest neighbor graph (for k=30) is mostly identical in the 2D embedding and in the 100D tICA model, there are some additional sanity checks I can run before we interpret these pictures too much.

Other comments

MDTraj offers several schemes for computing inter-residue distances. I've been using the scheme='ca' (which returns the distance between their alpha carbons), since it's significantly cheaper than scheme='closest-heavy' (which I presume is still cheaper than scheme='closest'), but I don't know which of the available schemes is preferred here.

jchodera commented 9 years ago

This is fantastic! Thanks so much for putting this together.

Some observations:

Projections onto the top few tICs are useless. As the figures showing the fraction of cumulative eigenvalue sum vs number of tICs show, we need to use many tICs to capture the slow subspace. Let's stop making projections onto the first few tICs---it is useless. Instead, we need to think of tICA as a way of eliminating both some of the fast irrelevant degrees of freedom and removing some of the correlations between degrees of freedom.

Interresidue distances are somewhat better than torsions. The sum of the first 100 eigenvalues is higher for the distance-based tICs than for torsion-based tICs. I think these distances are more natural to interpret/plot/understand as well, so there may be value in focusing on these instead of torsions. The difference isn't huge, but the sum of all the positive eigenvalues is much higher, which may mean that this is much more useful when the kinetic distance is used.

Clustering in spaces involving many tICs (or using tIC-derived kinetic distances) may be useful. We might be able build useful discrete-state MSMs by clustering in spaces that involve many tICs. Computing tIC-eigenvalue-weighted "kinetic distances" may be the best way to do this, since it then eliminates the need to worry about how many tICs to use.

We will need to think about how best to define the set of distances to use. As @maxentile notes, the number of contacts that change during the simulation depends on the cutoff, and peaks for some intermediate cutoff (~30 A at ~12K contacts). 30A is much larger than what one would typically think of as a "contact"---something like 3-8A is generally what one considers for a contact threshold, depending on the distance definition. (@maxentile: Which distance definition did you use here? The min distance between any atoms in the residue?) One thing I'd be curious to see is how well the dominant eigenspaces match up between using different numbers of contacts. To be more precise: As we change the contact threshold, do the dominant tICs change much? This is a bit difficult to define very precisely because the number of contacts changes, but we could look at the scalar products between eigenvectors restricted to the set of shared contacts, though we really want to look at scalar products between spaces of dominant contacts or some sort of eigenvalue-weighted form of this.

The t-SNE projection is interesting, but I am not sure it is useful. Really, there's only so much we can do with 2D projections. These may not be useful ways of trying to understand the system. I think we should focus on trying to extract useful properties from the discrete-state MSMs we obtain this way.

How do clustering-by-kinetic-distance MSMs compare to RMSD-clustered MSMs? I still think there is great value in RMSD-clustering-based MSMs (even using the uniformly spaced in time selection of generators). I'd love to see how this compares to MSMs built by kinetic distance clustering.

Are there ways we can turn this work into a paper exploring the best ways to build MSMs for kinases? We are accumulating a decent number of datasets here, so it may be worth trying to plan a useful benchmark and write it up.