MACE models and openMM OOM error

wiederm commented 7 months ago

Hi everyone!

We have been experiencing memory leaks when running molecular dynamics simulations with models trained with physics-ml and using the physics-ml openMM interface. I am suspecting that this originates from the physics-ml openMM interface.

I have attached a zipped folder containing the simulation script (simulation.py), the input topology and coordinate file and the trained MACE model.

simulation.zip

best regards, marcus

wardhaddadin1 commented 7 months ago

Thanks for posting this. Will have a look a now!

Just to check, it's a CUDA memory leak? and after how many steps roughly does it break?

wardhaddadin1 commented 7 months ago

Hey! so I tested this out and it doesnt seem to be failing at the same point (as I would expect from a memory leak). One time I ran it and it failed after a couple of mins ~2k steps and another I ran it and it ran for ~10k steps before failing. The gpu memory usage was constant for both of these until failure (screenshot below).

I also tested the model torchscript file physicsml_model.pt (which is created to be used in openMM) and I ran it (outside of openMM) with the initial positions for a while and the memory was stable as well.

My suspicion is: The error I saw when it failed once was from the neighbour list asking to allocate a lot of memory. This happens when there are a lot of edges (much more than what a usual point cloud would have). I suspect that at some point in the simulation there was an instability which lead to the positions being very close to each other resulting in a fully connected graph which doesnt fit on the gpu (this is also probably why this doesnt happen at the same number of steps).

wiederm commented 7 months ago

Thank you, Ward! Yes, it is a GPU memory problem.

adambaskerville commented 7 months ago

I ran the example you posted Marcus and I also ran into a CUDA memory issue where it tried to allocate 29.40 GiB of CUDA memory. I analysed the 28 frames which were saved before crashing and nothing seems overly suspicious in terms of the physics (see profile below) aside from some longer bond lengths than one might expect but I doubt this is responsible for the memory issue.

I will do some more investigation and try and verify the behaviour Ward describes as that sounds a sensible suggestion as to the cause of the crash (kind of a "not a bug but a feature" situation in that the crash may be indicative of the model offering poor inference in some configurations of water molecules).

wiederm commented 7 months ago

We have investigated this further and observed that in the frames before the GPU OOM error is thrown the temperature and potential energy increase dramatically. @AnnaPicha will post the plots in a minute.

AnnaPicha commented 7 months ago

Hi everyone! Thanks a lot for taking a closer look at our issue. As described in the script attached above by Marcus, we ran a simulation of a 125 molecules TIP3P waterbox with integration step 0.5 fs, using our "self trained" MACE model. After around 14K steps, we observe (as shown in the plot) a dramatic increase in temperature and energy. Note that the plot does not display the entire simulation but only starts at simulation step 14.220.

mace_self_trained

wardhaddadin1 commented 7 months ago

Im closing this now (hopefully the PBC fix should sort this out). If it's still persists, we can re-open it. Thanks everyone!

Exscientia / physicsml

MACE models and openMM OOM error #28