Alanine dipeptide data looks problematic

jchodera commented 8 years ago

The output from the alanine dipeptide test looks pretty awful. alanine_its_0 alanine_its_1 Note that the time axis is lying (it's some number of picoseconds, but the code assumes that each frame is 250 ps/frame--see #21), but there should be several slow resolvable processes.

Is this because we are using contact features here (which may not be good for alanine)?

jchodera commented 8 years ago

Whoa, we're using backbone torsions?

    feat = pyemma.coordinates.featurizer(top)
    feat.add_backbone_torsions(cossin = True)
    n_features = len(feat.describe())

That...is worrisome? Backbone torsions should do really well on alanine dipeptide.

maxentile commented 8 years ago

I removed contact features yesterday because they were taking too long to test. What do you suggest to do differently here?

Will fix the timestep issue.

jchodera commented 8 years ago

I see there is no longer any code that uses contact features. Perhaps we will want to add an option to the pipeline later that will let us select which featurization approach to use?

I'm mainly puzzled why backbone torsions give such poor results for alanine dipeptide. Good clustered feature sets should give plots that look like this, with multiple slow timescales resolved (five, to be precise, implying six metastable states can be resolved): alanine-dipeptide-timescales-minrmsd

This particular plot was from minRMSD clustering, but hand clustering of (phi,psi) should give similar behavior. Somehow, only one slow process is being resolved by your pipeline (implying only two states). I wonder if it is the choice of features or some other part of the pipeline.

I think I now have minRMSD clustering with equitemporal spacing down to just over 1 hour (for Abl single structure [10472]) with 12 threads, so it may be viable to add this possibility as well. Minibatch K-means is still not fully parallelized.

maxentile commented 8 years ago

The defaults here weren't chosen with alanine dipeptide in mind-- this was just a small dataset we could conveniently download during tests to make sure that plots were being generated / the correct matplotlib backend was being used. For example, the lag-time used for both MSM-estimation and tICA transformation is 50 frames by default, and only about 1 resolvable process in this test system is slower than 50 frames. I would guess that only this process is really resolvable in the tICA projection. That default sounds reasonable for kinases, but is not reasonable for alanine dipeptide.

Yes, that timescales plot you posted is what I would expect for a good discretization of alanine dipeptide as well: (https://github.com/maxentile/automatic-state-decomposition/blob/master/decompose-py/Alanine%20benchmark%20%2B%20performance%20comparison.ipynb)

from msmbuilder.cluster import MiniBatchKMedoids
kmeds = MiniBatchKMedoids(200,metric='rmsd',batch_size=200)
kmeds.fit(ala)

Would probably be good to add regular-time clustering for comparison. Initial results you posted suggest a few caveats to watch out for here: difference between GMRQ test vs. GMRQ validate seems very large with this discretization of another kinase system, and we'll likely need a different heuristic for choosing n_macrostates.

jchodera commented 8 years ago

For example, the lag-time used for both MSM-estimation and tICA transformation is 50 frames by default, and only about 1 resolvable process in this test system is slower than 50 frames.

Aha! That totally explains it.

Would probably be good to add regular-time clustering for comparison. Initial results you posted suggest a few caveats to watch out for here: difference between GMRQ test vs. GMRQ validate seems very large with this discretization of another kinase system, and we'll likely need a different heuristic for choosing n_macrostates.

The difference is very large even with using twofold splits for cross-validation? Hm...

maxentile commented 8 years ago

Aha! That totally explains it.

Yep, I think so!

The difference is very large even with using twofold splits for cross-validation? Hm...

Yep! Details in the Slack post. In that example, GMRQ train: ~1500, GMRQ validate: ~600. Another thing I looked at was how much the trace of the estimated transition matrix depended on the quantity of data it saw. With all the data, the trace was ~3000, and with half the data, the trace was ~1600. This seems like an issue to me, since the trace appeared to be ~ linearly proportional to the amount of data included. I would expect that, if our clusters represent metastable states that are being visited more than once, the trace of the estimated transition matrix should eventually not increase as we see more data. I'll plan to redo this sanity-check for the Abl and Src discretizations in the morning.

As an example when the system is very thoroughly sampled, this quantity is constant over a large range for the 200-state discretization of alanine dipeptide, using k-medoids w.r.t. RMSD:

jchodera commented 8 years ago

Which slack post are you referring to for details?

maxentile commented 8 years ago

Oops, yes, it was sort of buried up there: https://choderalab.slack.com/files/josh.fass/F1YF3TGFR/Updates_on_MSM_model_selection___metastable_state_identification

Also, I had made that "fraction of trajectories used vs. trace(msm.P)" plot for both reg-time RMSD and minibatch K-medoids w.r.t. RMSD discretizations:

def dependence_plot(dtrajs, lag = 10, n_replicates = 10):
    n_trajs = len(dtrajs)
    traj_indices = np.arange(n_trajs)

    traces = []

    for size in range(1, n_trajs):
        trace = np.zeros(n_replicates)
        for i in range(n_replicates):
            np.random.shuffle(traj_indices)
            msm = pyemma.msm.estimate_markov_model([dtrajs[j] for j in traj_indices[:size]],lag)
            trace[i] = np.trace(msm.P)
        traces.append(trace)
    full_msm = pyemma.msm.estimate_markov_model(dtrajs, lag)
    plt.errorbar(1.0*np.arange(1,11)/10, [np.mean(t) for t in traces] + [np.trace(full_msm.P)], [np.std(t) for t in traces] + [0])
    plt.ylim(0, np.max(traces))
    plt.xlabel('Fraction of trajectories used')
    plt.ylabel('Trace of estimated transition matrix')

choderalab / msm-pipeline

Alanine dipeptide data looks problematic #22