Acellera / htmd

HTMD: Programming Environment for Molecular Discovery
https://software.acellera.com/docs/latest/htmd/index.html
Other
254 stars 58 forks source link

How to choose the TICA lag time? #230

Closed yongwangCPH closed 7 years ago

yongwangCPH commented 7 years ago

I found the final MSM is very dependent on the choice of the TICA lag time. In the tutorials, I noticed that you used 2ns for both binding and folding cases. Any reason to chose this value? Any comments are appreciated.

stefdoerr commented 7 years ago

Both papers that came out introducing TICA said it doesn't affect the results much. From our experience and apparently yours it does seem to affect them. I have not seen any comment on it from Frank's group or Pande's. Maybe Guillermo @gph82 has some new ideas on it? This is more a discussion than an issue so I'll close it but we can keep talking here.

stefdoerr commented 7 years ago

Actually now I remember Guillermo showing some heuristic for picking the lagtime based on autocorrelations. But the recent publications I saw focused more on finding the best number of dimensions than the best lag-time.

gph82 commented 7 years ago

This remains a question hard to answer in one line. There are definitively wrong lagtimes (too short or too long), but everything in between is essentially equally valid, as long as you arrive at an MSM/HMM that's robust, reproducible and helpful. @yongwangCPH , can you clarify what "very dependent" means?

yongwangCPH commented 7 years ago

Thanks a lot for your comments. @gph82 The MFPT could be different in one or two orders of magnitude when I used different TICA lag times (e.g. 2ns, 50ns). Anyway, it is good to know it is still a kind of open question.

gph82 commented 7 years ago

I am sorry but that sentence alone is not really helpful. You could be in a sort of "oranges vs. apples" situation. Varying the tica lagtime by one order of magnitude certainly might lead to different subspaces, particularly if you're keeping few TICs. How many TICs are you keeping at lagtime? What does the spectrum look like?

On 23.01.2017 22:52, yongwangCPH wrote:

Thanks a lot for your comments. @gph82 https://github.com/gph82 The MFPT could be different in one or two orders of magnitude when I used different TICA lag times (e.g. 2ns, 50ns). Anyway, it is good to know it is still a kind of open question.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Acellera/htmd/issues/230#issuecomment-274629102, or mute the thread https://github.com/notifications/unsubscribe-auth/AHK3NN_s3NKlfuS_9WpZC50BbVYkPBXOks5rVSEQgaJpZM4LrOCN.

-- Dr. Guillermo Pérez-Hernández Freie Universität Berlin Institute for Mathematics Arnimallee 6 D-14195 Berlin tel 0049 30 838 75775 http://userpage.fu-berlin.de/gph82/

stefdoerr commented 7 years ago

@gph82 Here is an example from Adria of how the lagtime affects the timescales. Tested with 2, 3 and 5ns lagtimes. http://nbviewer.jupyter.org/gist/stefdoerr/aa937f2f8e513d78d3482db841fa76ba

stefdoerr commented 7 years ago

Admittedly the simulations are 14ns so he could have chosen a larger lagtime at the beginning. Still, as we can see, finding the best lagtime is a bit of a trial and error procedure.

gph82 commented 7 years ago

Hi, it's a bit confusing with all the warnings and the output, but I guess you want me to compare the its plots for TICA done at 20, 30, 50., right?

What I don't get is...why do tica at 20, 30, 50 if you're computing MSMs between 0 and 15? Are this the same units?

stefdoerr commented 7 years ago

No, sorry for the units. We have improved this in new versions. The timescale plots are in ns units and the TICA arguments are in simulation steps. So 30 tica lag = 3ns lag

stefdoerr commented 7 years ago

So I would probably build the model at lag 4ns = 40 frames

gph82 commented 7 years ago

ok so we have tica at 2, 3, 5 ns and its between 1 and 15 ns?

stefdoerr commented 7 years ago

yes

gph82 commented 7 years ago

can you produce a tica-its plot? I suspect there's a process missing between 30 and 40. I'd say 40 and 50 are similar. Plus, very big point...who says 3 projections are enough? How does the TICA-spectrum look?

stefdoerr commented 7 years ago

TICA-spectrum as in the kinetic variance? I will check that. How do you do TICA its in pyemma?

gph82 commented 7 years ago

Every tica object has an .timescales atttribute. Stack them and plot them

stefdoerr commented 7 years ago

Ok, I'll report back when I have those :)

stefdoerr commented 7 years ago

I only did the first one but is this what you meant? I plotted the cumvar.

However for timescales they are only computed at the given lagtime. I would need to compute TICA for multiple lag-times to make a TICA its plot right? Any method for that? I could not find a timescales_tica function in PyEMMA. I can do it manually I guess. But TICA is quite slow so it would take lots of time in any case.

http://nbviewer.jupyter.org/gist/stefdoerr/17b5cafde1e20b808225b3fbe4018152

stefdoerr commented 7 years ago

By the way this is what I pretty much always see with the cumvar. That I need hundreds to thousands of dimensions to reach 95% variance.

gph82 commented 7 years ago

Well you had already 3 tica objects so you could do it. Plus, can you rescale the plots in cells 12, 20 to be xlim([0,50])

stefdoerr commented 7 years ago

Nah I had restarted the kernel. Ok will do

stefdoerr commented 7 years ago

http://nbviewer.jupyter.org/gist/stefdoerr/1b1ec03057dcb4eabfad97cdd85c5a2e

gph82 commented 7 years ago

there are some issues with the kinetic variance criterion always being useful, particularly in high-dimensional spaces. This discussion essentially boils down to: Is there a timescale gap? And there isn't...we can fiddle around all day, but this is a hard problem...could I get an .npy with the orginal features and make my point?

stefdoerr commented 7 years ago

Why no timescale gap? I see a very big gap between the first eigenvalue and the second.

http://pub.htmd.org/FAAH_features.zip Here you go. It's 500MB zipped

stefdoerr commented 7 years ago

Sorry that the gap was not totally visible. Should have fixed the xlim to [-1, 50] instead of [0, 50]. But you can see it in the plots if you look close.

gph82 commented 7 years ago

thanks for the file. I'll take a look and see if I can make my point and perhaps help. It's been day #2 of the workshop and I'm out-pythoned

stefdoerr commented 7 years ago

Hahah sure, take your time. Good luck with the workshop.

j3mdamas commented 7 years ago

Guillermo was on fire today with his new projX module :p

On Feb 21, 2017 17:23, "Stefan Doerr" notifications@github.com wrote:

Hahah sure, take your time. Good luck with the workshop.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Acellera/htmd/issues/230#issuecomment-281394754, or mute the thread https://github.com/notifications/unsubscribe-auth/AKTzQlbdklcYBiXmDhFqr2pXZ6XELmZPks5rew-YgaJpZM4LrOCN .

and-tos commented 6 years ago

Hi, I came across this thread and was wondering if you have any advice on choosing the lag time for TICA. What exactly is meant by timescale gap? Is there any criteria I can use to decide whether my TICA lag time is acceptable?

stefdoerr commented 6 years ago

On the question of timescale gap: image If you calculate the eigenvalues of the transition probability matrix built at one lag time you usually see a separation between some large eigenvalues and some small ones. The large values correspond to the slow processes and the small ones to fast processes. Each eigenvalue gives you the timescale of the corresponding process. This gap between the eigenvalues or timescales is what is called the timescale gap. If you do the typical ITS (implied timescale) plot where you calculate the matrix at multiple lag times and plot the eigenvalues as lines image you can see that there is a separation of 2-3 slow processes in this plot and the rest which are very fast.

stefdoerr commented 6 years ago

As for choosing the TICA lagtime I don't have much of a tip to give you other than:

  1. Obviously choose one that's smaller than the lag-time of the model you want to build
  2. It depends on the length of your simulations. If your simulations are 2ns you cannot choose a lagtime of 3ns so go for something like 1ns lagtime (?)
  3. If you have more time, play around and see what gives better results
hima111997 commented 6 months ago

Hi! I want to perform dimensionality reduction on an MD simulation using TICA. My goal is to generate a free energy surface using the first three TICA components, similar to how PCA is often used. Since I'm not building an MSM, could you suggest a good starting point for the lag time value and explain the factors I should consider when choosing it?

stefdoerr commented 6 months ago

http://docs.markovmodel.org/lecture_tica.html Especially the references at the end of the tutorial

stefdoerr commented 6 months ago

The conclusion at least in the original paper was that the selected lag time did not affect strongly the results. We usually set it to something like 2ns