Lumping strategy - Githubissues

jchodera commented 8 years ago

I seem to be making good progress with the minRMSD based equitemporal clustering. You can see my results for the matrix of Src/Abl single/ensembler projects (cc: #23) here on the cluster:

/cbio/jclab/home/chodera/github/choderalab/msms/MSMs.jchodera/jchodera/abl-*
/cbio/jclab/home/chodera/github/choderalab/msms/MSMs.jchodera/jchodera/src-*

My equitemporal minRMSD clustering script takes only about 1 hour on 12 threads, as opposed to over 12 hours for the current torsion-based featurization and minibatch K-medoids clustering in the msm-pipeline.

I'm running just the post-clustering analysis part of msm-pipeline on the minRMSD-generated dtrajs for abl-10472 in

/cbio/jclab/home/chodera/github/choderalab/msms/MSMs.jchodera/jchodera/abl-10472/pyemma

but it seems to be taking forever. Output is:

Running pipeline
Trace of transition matrix: 1098.476
Active count fraction: 0.992
Active state fraction: 0.992
Estimated n_macrostates: 211
20-08-16 10:09:47 pyemma.msm.estimators.maximum_likelihood_msm.MaximumLikelihoodMSM[0] WARNING  Requested coarse-grained model with 211 metastable states at lag=50.The ratio of relaxation timescales between 211 and 212 states is only 1.00254527534 while we recommend at least 1.5.  It is possible that the resulting HMM is inaccurate. Handle with caution.

Looks like the MSM lumping is just incredibly freakishly slow. Any idea why? Can we use something that is orders of magnitude faster and get something useful out?

jchodera commented 8 years ago

It looks like only the HMM approach is available in coarse_grain, which is what we use in our coarse-graining step:

# coarse-grain
hmm = msm.coarse_grain(n_macrostates)

@maxentile: Would it be easy for you to plug your SASA code into this (via a PR) to give a faster alternative? And I wonder why they haven't tried PCCA yet.

jchodera commented 8 years ago

The MSM lumping also suggests taking only places where the ratio of successive eigenvalues lambda_i / lambda_{i+1} > 1.5 as cut points for n_macrostates. I wonder what options that gives us for our kinds of systems. If there are a sufficient number of these "good" cut points, we might be able to do something like "find the largest n_macrostates from this set where the timescales are greater than X".

jchodera commented 8 years ago

Manually setting n_macrostates to 10 even failed to complete in several hours. I suspect the MLHMM is just way too slow when all the data is used.

For my minRMSD equitemporal clustering of abl-10472 using backbone RMSD, there are plenty of timescales longer than 100 ns, and only a few places where the timescale ratio is > 1.5. abl_timescales abl_timescale_ratios

maxentile commented 8 years ago

Sorry for my slow response!

I seem to be making good progress with the minRMSD based equitemporal clustering. You can see my results for the matrix of Src/Abl single/ensembler projects (cc: #23) here on the cluster:

Thanks, John! Fetched them to take a look...

My equitemporal minRMSD clustering script takes only about 1 hour on 12 threads, as opposed to over 12 hours for the current torsion-based featurization and minibatch K-medoids clustering in the msm-pipeline.

Yes, regular-time RMSD clustering should be faster and more easily parallelizable than iterative clustering algorithms, since all the distance computations we need to do are independent. The current proposed "pipeline" expends additional computation to pick the metric (using tICA) and the exemplars (using k-means), and can typically get away with fewer exemplars. In regular-time RMSD clustering, we pick the metric and exemplars without doing any computation. If we can get away with not spending any computation on those steps for kinase systems, that would be convenient. I've mentioned some potential caveats I think I've noticed with the preliminary results using regular-time RMSD so far on the other thread, but these are probably not showstoppers and can probably be mitigated by suitable coarse-graining.

I'm running just the post-clustering analysis part of msm-pipeline on the minRMSD-generated dtraj

Hmm, at this point, I think it would probably be useful to split up our analysis into pre- and post-discretization, since a lot of what we're doing is looking at the statistics of discrete trajectories...

Looks like the MSM lumping is just incredibly freakishly slow. Any idea why? Can we use something that is orders of magnitude faster and get something useful out?

Well, by default, msm.coarse_grain(n_macrostates) estimates a maximum likelihood HMM, using PCCA to initialize, and takes up to 1000 passes over the data in the EM algorithm, which will take a long time on large datasets.

Something I had started looking into the first time this came up was a fast method-of-moments estimator for HMMs, due to Anandkumar, Hsu, and Kakade (a more streamlined presentation is here). Essentially, it takes a single pass over the data to accumulate an n_states x n_states x n_states tensor of third-order moments, then estimates HMM parameters via a rank-n_macrostates decomposition. The result can be used directly, or can be used to initialize the EM algorithm, which could reduce the number of iterations needed for convergence. At the time, I think you said this was a distraction, but I still think this could be promising in making HMMs / projected Markov models more usable for big datasets. Probably @jhprinz has a much clearer idea than I do of where this can be sped up!

In the meantime, we can use msm.pcca(n_macrostates), and PCCA is also implemented in MSMBuilder. I can also put together PRs for including metastability optimization in MSMBuilder and PyEMMA this week, depending on what you see as the priorities.

And I wonder why they haven't tried PCCA yet.

PCCA is implemented in both MSMbuilder and PyEMMA, and is used in the coarse_grain method to initialize the HMM estimator.

The MSM lumping also suggests taking only places where the ratio of successive eigenvalues lambdai / lambda{i+1} > 1.5 as cut points for n_macrostates. I wonder what options that gives us for our kinds of systems. If there are a sufficient number of these "good" cut points, we might be able to do something like "find the largest n_macrostates from this set where the timescales are greater than X". For my minRMSD equitemporal clustering of abl-10472 using backbone RMSD, there are plenty of timescales longer than 100 ns, and only a few places where the timescale ratio is > 1.5.

Hmm, I thought the point of choosing 100 ns as the threshold is because that's ~ the feasible simulation time in the subsequent free energy calculations. Choosing n_macrostates near a separation of timescales appears necessary for the HMM coarse-graining approach, but if the macrostates contain internal kinetic barriers > 100ns, will that be a problem later? Also, do you have an idea why 1.5 is the threshold here, vs. any other number?

choderalab / msm-pipeline

Lumping strategy #25