bioinfkaustin / gromacs-on-colab

Google Colab notebooks for running molecular dynamics simulations with GROMACS
GNU Affero General Public License v3.0
21 stars 5 forks source link

Checkpoint becomes inconsistent during long production simulations #3

Closed bioinfkaustin closed 11 months ago

bioinfkaustin commented 11 months ago

When continuing a simulation with gmx mdrun, by default, the output trajectory file is appended to. Broadly speaking, this file becomes large (gigabytes) when the simulation time is long (hundreds of nanoseconds).

Currently, GROMACS-on-Colab performs the simulation in a series of small batches, typically 1 ns long. This allows the software to upload a copy of the trajectory and "checkpoint" (a file GROMACS uses to resume the simulation) to Google Drive every batch, limiting the amount of data lost if the session is cut off early by Google Colab.

When the trajectory is very large, it takes substantially longer to upload than the checkpoint file does. This means there is a window in which, if the session ends, the uploaded files will be in an inconsistent state -- i.e. the (old) trajectory will not match the (new) checkpoint.

Although the old trajectory file still contains valid, usable data for the simulation thus far, this scenario can result in an inability to resume or extend the simulation.

Fix:

Instead of each batch appending to the trajectory, each batch should make a small partial trajectory, called a "part-file". This behavior is enabled by passing the -noappend option to gmx mdrun. This limits the size of data that must be uploaded upon the completion of each batch.

Furthermore, by combining the checkpoint and the part-file into a partXXXX.tar.gz archive, one can ensure that they are uploaded simultaneously. If an uploaded archive somehow becomes corrupted, that can be detected using gunzip --test.

bioinfkaustin commented 11 months ago

Fixed in 6530aee.