Memory Error When Pickling

msinclair-py commented 11 months ago

I am getting MemoryError when writing out checkpoints consistently around the time 40 simulations have run (70k atoms, 8 rounds of 5 trajectories at 5ns each) using MaxEnt. I am using g5xlarge nodes on AWS (NVIDIA A10g; 16Gb RAM) and the pickle dump crashes around 5Gb written which is not adding up to me. Per the author of dill (in this thread: https://stackoverflow.com/questions/17513036/pickle-dump-huge-file-without-memory-error) it seems that maybe the klepto package would be more suitable for writing these checkpoints out.

In the meantime is it possible to restart from the log files rather than the checkpoints? I was writing out checkpoints every 5 rounds and it would be nice to not have to restart from 5 rounds ago, but either way I plan to run without checkpointing for the time being.

diegoeduardok commented 11 months ago

Serializing the entire simulation object might be impractical for large systems. I used this approach for convenience, but I will work on writing a method to restart a simulation object from the logs only. This would be basically an alternative constructor that reads the pickle file and performs some operations on the saved trajectory files. This will require that all the trajectory files are available exactly where the file handler object expects them. Please be patient as adding these changes might take a while because I have other responsibilities.

I was not aware of the klepto package. I still have to assess how easy it would be to replace dill with this package, but I will look into it!

msinclair-py commented 11 months ago

Klepto is by the same author as dill and actually uses dill under the hood from my understanding. They gave an example of using it to serialize a large object in the stackoverflow page I linked.

I had one other question regarding how memory is handled in the MaxEnt workflow: are all of the openmm trajectories being stored in memory? I have my system crash somewhere in the range of 100-250ns pretty consistently and suspect this may be to blame, but am new to openmm so not sure. I should have mentioned this in my original post but I am using explicit solvent simulations with a class I wrote which inherits the InVacuoSim class in the same manner as the implicit solvent class (and as such my system is ~80k atoms).

diegoeduardok commented 11 months ago

I'm already working on the new restart procedure that uses the log files. That should be sorted out pretty soon.

About your second question: the trajectories are not stored in memory, but their projections in the collective variable space and the VAMPNet-transformed space are. If you are using a large number of collective variables (for example, all residue-pair distances) for a large protein, then you would probably run into issues. In this case, I suggest that you narrow down the collection of collective variables that go into the EntropyBasedSampling.features parameter.

I am looking at ways to improve the data efficiency so that there is no memory bloating. However, this will involve a significant amount of writing/reading .npy files to/from disk, which might result in slower execution.

diegoeduardok commented 11 months ago

OK, I have added the ability to restart from log files.

# Restart from logfile
adaptive_run = EntropyBasedSampling(log_file="path/to/logfile.pkl")
adaptive_run.run_round(n_select=trajs_per_round, n_steps=traj_len, n_repeats=1) # Continues from previous round

A warning: I had to change the information saved in the log file to be able to rebuild the EntropyBasedSampling object, so don't try to use this with older log files.

I tested only EntropyBasedSampling and VampNetLeastCounts. Although it has also been implemented for some of the other methods, I haven't tried them yet. Still, I figured the main interest is in EntropyBasedSampling, so I think it is better to release it now. I'll leave the issue open until I finish with the other methods, but I will prioritize the memory optimizations next.

msinclair-py commented 11 months ago

This is great thank you! I appreciate all the time you have spent on this and the framework has been very useful so far.

ShuklaGroup / MaxEntVAMPNet

Memory Error When Pickling #3