coarse-graining / cgnet

learning coarse-grained force fields
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

dt parameter unit in Simulation() #188

Closed jasonzzj97 closed 3 years ago

jasonzzj97 commented 3 years ago

Hi,

After training a CGnet model, I'm using the predicted forces to conduct CG simulations for comparison by the simulation() function in the package. The time step size parameter dt is described as follows:

dt: float (default=5e-4) The integration time step for Langevin dynamics. Units are determined by the frame striding of the original training data simulation.

What's the physical unit of parameter dt in the simulation codes, as the default number is 5e-4? (picosecond, femtosecond?) What do you mean by "Units are determined by the frame striding of the original training data simulation"? Suppose I have the all-atomic mapped CG results, which are used as ground truth in training, with time step size 1.0 ps between every frame, what should I choose to put into the function simulation() as parameter dt?

Another related question, how long we should run, in terms of time steps length, for the evaluation CG simulations to compare the free energy calculation by the TICA method? Do you have a specific criterion to measure if the CG system has reached the equilibrium state?

Thank you for you help!

nec4 commented 3 years ago

Hi Jason,

Thanks for reaching out. The dt parameter is actually a rescaled "CG time". So you can fix the units to be whatever you like (say, the units of the all-atom simulation data), but if you want to relate it back to the original system you may have to estimate a scaling factor between the timescales between the two systems. Our (force-matching) method does NOT guarantee accurate recovery of kinetics or timescales (this includes absolute timescales and relative timescales). You should pick a dt that allows for reasonably fast integration while still resulting in a stable simulation. For your specific CG mapping, you may have to experiment.

It is typically hard to get converged simulations for high dimensional systems. If you can see transitions between major metastable states then that is usually a good sign. You can also try parallel tempering techniques to explore the free energy surface more efficiently and fully. A truly converged simulation has an equilibrium density that does not change over long periods of time. In terms of MSM/TICA analysis, you can check the implied timescales as a function of the lag time to see if they converge after a certain lag time. Again, depending on your system, you may have to experiment.

Hope that helps!

jasonzzj97 commented 3 years ago

Hi Nick,

Thanks for your explanation! I did a few experiments with different dt values in the CG simulation, with my ground truth training set which has 1.0 picosecond as frame stride. I tried dt=1.0, 0.1, 0.01, and they all caused unstable simulation (NaN found in dihedrals or distances). I used dt=1e-6 and the simulation could run for a longer time, like 1M steps, with converged potentials. In this case, does that mean this CG simulation has 1.0*1e-6 ps=1e-3 fs as time step size for physical time scales in Langevin Dynamics? If I want to compare this simulation with my original ground truth dataset, should I do a time averaging and turn 1M frames into 1 frame (to make the simulated dataset also have 1.0 ps as frame stride)?

nec4 commented 3 years ago

Hello,

I'm glad that you could get some good simulations. The scaling is not that simple. If you really want to compare the kinetics to your original simulations, what I would do is the following: perform MSM/TICA analysis separately on both your reference and cg simulations. Rescale the CG simulations so that the longest timescales more or less overlap (again, because our method does not guarantee recovery of accurate kinetics, you may have to decide your own criteria). This rescaling factor can then be used to relate the units your original system to your CG system. Preservation of kinetics in coarse graining is an area of active research.

Hope that helps!

jasonzzj97 commented 3 years ago

Thank you for your inspiring suggestions! I would really appreciate it if you could explain a lot bit more details about

This rescaling factor can then be used to relate the units your original system to your CG system.

Because I looked into your example Jupyter notebooks for example "Training-A-Coarse-Grained-Force-Field.ipynb". The CG simulation is using the default dt which is 5e-4. But in the later free energy plot comparisons of original and CG trajectories, I didn't see any rescale operations. Could you please correct me if I'm wrong?

nec4 commented 3 years ago

Of course. We do not rescale anything here because we are only interested in comparing the free energy landscape between the two systems, which is an (equilibrium) thermodynamic quantity and not a kinetic quantity. The idea behind the method of force matching is recover the free energy surface from simulating a CG representation of the system (at least as represented by variables that can mutually describe the free energy surfaces in both the original and the CG systems). That is, the simulation of the CG system is just a means of sampling the free energy surface to see if it can be reproduced. There is no kinetic analysis (at least that compares or relates the CG and the original systems) in the example notebooks.

jasonzzj97 commented 3 years ago

Great. That's what I thought! I actually also have exactly the same interests as you show in the notebook, to compare the free energy landscape instead of kinetic energy. And my goal is also the same: to make my CG system present or restore similar free energy surfaces as in the original system. I know the CG simulation only serves to reach the equilibrium states for sampling but I just want to measure the length that how long it has run/simulated in real physics unit. This is why I care this dt parameter value so much because only knowing the length step number (1M steps) cannot convert to simulated physics time. Is there a possible way to do it?

Thank you so much for your time!

nec4 commented 3 years ago

It's no problem. But for that I would again recommend performing MSM/TICA analyses separately on the two simulations to extract a series of timescales associated with important transitions between metastable states. Once you have a set of timescales for each system you can can globally rescale the CG timescales (for example) to overlap with the original system's timescales. This will at least give you a good estimate of the rescaling factor if your first few timescales can be overlapped after the rescaling. Consider matching just the first timescale. Are you familiar with PyEmma? This is a great package for performing these kinds of analyses. They have great tutorials too:

http://www.emma-project.org/latest/tutorial.html

To my knowledge, there is no simple way of relating CG time to physical time, and the method of force matching does not guarantee that a trained model preserves the original kinetics (it is derived entirely from equilibrium statistical mechanics). In general, CG models will typically explore free energy landscapes faster than all-atom models, and this effect nontrivially depends on your choice of CG mapping, your simulation integrator, and the functional form of your CG Hamiltonian. There are efforts to try train models to reproduce kinetics, such as spectral matching, but these can be difficult to use in high dimensions. In sum, I would only use force matching to investigate how well equilibrium properties can be recovered.

jasonzzj97 commented 3 years ago

I see. I have been using PyEmma for the TICA analysis and MSM construction of my MD data. Thank you for the valuable remarks, these are very helpful and I think other researchers could also find them useful!