Colvars / colvars

Collective variables library for molecular simulation and analysis programs
http://colvars.github.io/
GNU Lesser General Public License v3.0
196 stars 56 forks source link

ABF histogram overflow shortly after 2 microseconds #639

Closed mvondomaros closed 7 months ago

mvondomaros commented 7 months ago

Hi!

I have several independent simulations (Colvars 2023-10-03, Gromacs 2023.2, WTM-eABF with distanceZ colvar), that run fine for 2 microseconds, but eventually cause some sort of overflow in the ABF histogram. See screenshot attached. The histogram did not have a spike when starting the run. Contrary to what one might believe from the histogram, the colvars trajectory is not stuck at this position. The wrong counts lead eventually to artifacts in the free energy, presumably because CZAR works with them.

Since all my simulations have this issue shortly after 2 microseconds (1 fs timestep), and since the histogram appears to overflow at the bin corresponding to the value of the collective variable at step ~2^32/2, I am suspecting some sort of Int32 overflow, but with my limited knowledge of the code, I haven't found any possible candidate.

colvars abf count colvars traj Raw files are on Google Drive

jhenin commented 7 months ago

Thank you @mvondomaros ! This does look like an overflow. I'll look into it and let you know.

jhenin commented 7 months ago

The histogram of the actual (not extended) variable is also concentrated there, with some smoothing, which would point towards a physical tendency to reside there. Except that as you noted, that is not visible in the colvar trajectory. image

jhenin commented 7 months ago

The colvars trajectory file you provided ends with step 2147483600, which is the largest multiple of 100 before 2^31. Did it stop there or did you stop it intentionally?

jhenin commented 7 months ago

Your simulations may be impacted by a bug that prevented the writing of Colvars information to the checkpoint file: https://github.com/Colvars/colvars/commit/255b1f1fd6fc6d7a37ae6d8264ff5bb182667cdd That was only in the repo for a limited time, but your version was just before the patch. It is not obvious whether and how that could have caused this particular issue. The ABF restart data was read explicitly from the state file. One possible related issue is with PBC unwrapping, as this relies on coordinates that are stored by Colvars in the checkpoint file - but this does not look like a PBC issue, especially the apparent discrepancy between the trajectory and histogram.

mvondomaros commented 7 months ago

@jhenin Thanks for looking into this so quickly, Jérôme. I did not intend to stop at this step, though I suspected it plays a role, that's why it's narrowed down to this region. The simulation continues until step 2150000000 and would happily run further (at least visual inspection did not show anything odd).

I am working on providing a somewhat more constrained simulation example if that is any help.

I am also trying with setting the initial time step in Gromacs to 0, which I suspect will work.

Michael

mvondomaros commented 7 months ago

Your simulations may be impacted by a bug that prevented the writing of Colvars information to the checkpoint file: 255b1f1 That was only in the repo for a limited time, but your version was just before the patch. It is not obvious whether and how that could have caused this particular issue. The ABF restart data was read explicitly from the state file. One possible related issue is with PBC unwrapping, as this relies on coordinates that are stored by Colvars in the checkpoint file - but this does not look like a PBC issue, especially the apparent discrepancy between the trajectory and histogram.

OK, I'll check an up-to-date version tomorrow/later this week, just to be sure.

As I mentioned before, this happens in three independent simulations, always at the same number of steps. I doubt that it is PBC related, but I can check regardless. I suppose wrapping in VMD and restarting from there should do the trick?

mvondomaros commented 7 months ago

I ran the following test with Gromacs 2023.3 and Colvars 2023-12-04 (this time on an ARM64 platform).

I just simulated a single water molecule for 1000 steps, but set the initial step in Gromacs to 2147483000. My collective variable is distanceZ of the water molecule COM with respect to a dummy atom in the center of the box. Same colvars configuration as before, but this time, I am printing after every step. Here is an excerpt from the colvars trajectory.

  2147483645    2.46377649766062e+00  2.75753411096014e+00 -7.32731008373043e+03  0.00000000000000e+00
  2147483646    2.46151376358575e+00  2.76503069888354e+00 -7.57074063752979e+03  0.00000000000000e+00
  2147483647    2.45967146335560e+00  2.77220941549210e+00 -7.79575535937939e+03  0.00000000000000e+00
  2147483648    2.45826158459925e+00  2.77905475353238e+00 -8.00166843376015e+03  0.00000000000000e+00
  2147483648    2.45729600822559e+00  2.77905475353238e+00 -8.02575317975150e+03  0.00000000000000e+00
  2147483648    2.45677593966753e+00  2.77905475353238e+00 -8.03872545150602e+03  0.00000000000000e+00

The step counter does not increase after reaching 2^32/2 = 2147483648. In this case, the collective variable gets stuck at 2.7790 (where it just happens to be when reaching this step) and the physical variable fluctuates around this value. This is consistent with an increased bin count in this bin. I do not observe this behavior when I start with nsteps=0.

jhenin commented 7 months ago

Thank you, this is now quite clear! The extended Lagrangian integrator decides whether time is moving forward by watching the timestep number - if that gets stuck, it thinks this is a repeated timestep and stops integrating, so the extended variable stands still. Checking all the data types involved, I found an internal variable tracking the previous known timestep, that was an int! Creating a PR to fix this.

jhenin commented 7 months ago

@mvondomaros This issue got closed automatically, but please do let us know if that does fix your issue, and of course if anything else comes up.

mvondomaros commented 7 months ago

@jhenin Thanks a bunch for the patch, Jérôme. I ran some tests and the issue seems resolved. Cheers, Michael!