Getting `CUDA_ERROR_OUT_OF_MEMORY` error when running replica exchange (with lots of replicas) on large system

zhang-ivy commented 2 years ago

I'm trying to run replica exchange (rest) on a system with 185K atoms, and it runs fine when I use a rest region of ~70 atoms with 12 replicas. However, I get the following error when I run with a rest region of ~1000 atoms with 24 or 36 replicas.

I'm currently using this as the context cache: context_cache = cache.ContextCache(capacity=None, time_to_live=None)

DEBUG:openmmtools.utils:Mixing of replicas took    3.747s
DEBUG:openmmtools.multistate.replicaexchange:Accepted 12284/93312 attempted swaps (13.2%)
DEBUG:openmmtools.multistate.multistatesampler:Propagating all replicas...
DEBUG:mpiplus.mpiplus:Running _propagate_replica serially.
Traceback (most recent call last):
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/cache.py", line 445, in get_context
    context = self._lru[context_id]
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/cache.py", line 147, in __getitem__
    entry = self._data.pop(key)
KeyError: (366470656340979585, 4946973717435834380)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "generate_rest2_cache_interface.py", line 153, in <module>
    simulation.run()
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/multistate/multistatesampler.py", line 681, in run
    self._propagate_replicas()
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/utils.py", line 90, in _wrapper
    return func(*args, **kwargs)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/multistate/multistatesampler.py", line 1196, in _propagate_replicas
    propagated_states, replica_ids = mpiplus.distribute(self._propagate_replica, range(self.n_replicas),
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/mpiplus/mpiplus.py", line 512, in distribute
    all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/mpiplus/mpiplus.py", line 512, in <listcomp>
    all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/multistate/multistatesampler.py", line 1223, in _propagate_replica
    mcmc_move.apply(thermodynamic_state, sampler_state)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/mcmc.py", line 1114, in apply
    super(LangevinDynamicsMove, self).apply(thermodynamic_state, sampler_state)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/mcmc.py", line 655, in apply
    context, integrator = context_cache.get_context(thermodynamic_state, integrator)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/cache.py", line 447, in get_context
    context = thermodynamic_state.create_context(integrator, self._platform, self._platform_properties)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/openmmtools/states.py", line 1172, in create_context
    return openmm.Context(system, integrator)
  File "/home/zhangi/miniconda3/envs/perses-rbd-ace2/lib/python3.8/site-packages/simtk/openmm/openmm.py", line 4948, in __init__
    _openmm.Context_swiginit(self, _openmm.new_Context(*args))
simtk.openmm.OpenMMException: Error creating array savedForces: CUDA_ERROR_OUT_OF_MEMORY (2)
DEBUG:mpiplus.mpiplus:Single node: executing <bound method MultiStateReporter.close of <openmmtools.multistate.multistatereporter.MultiStateReporter object at 0x2b547fb44ac0>>

To reproduce the issue, I can point to where the files are located on lilac (the files are large, so it doesn't make sense to drop them here): In /data/chodera/zhangi/perses_benchmark/neq/14/147/for_debugging/:

perses-rbd-ace2-direct.yml -- contains a yaml of the env I'm using
generate_rest_cache_interface.py -- rest script
run_rest2_complex_for_debugging.sh -- bash script
147_complex_1.pickle-- pickled RepartitionedHybridTopologyFactory needed for rest script
147_complex_1_rest.pickle -- pickled RESTTopologyFactory needed for rest script
Note that to use the above two pickled factories with the rest script, you'll need to create a dir called 147/ and place the factories and scripts in that directory
- You'll also need to make sure you set outdir in the rest script to point to your 147/ directory There are also xmls for the system (the REST system), state (at thermodynamic state with temp 298 K), and integrator (attached to the context cache --though i dont think this is actually being used anywhere).

Some observations I made:

The error happens more often when I use job arrays than when I don't use job arrays. However, I do see the error in both scenarios.
From glancing at which GPUs have been pulled for my jobs, it doesn't seem like one type is responsible for the error. I've seen the error happen on lt, ly, lx , and lu . I've also seen absence of an error on these nodes.

ijpulidos commented 2 years ago

@zhang-ivy I've been playing a bit with this and I can see that the system getting out of memory is real. I didn't know that OpenMM was capable of processing things in both the system memory ("CPU" RAM) and the GPU. So, when it runs out of memory in the GPU it starts using the System Memory (at the cost of performance of course, the more it is on CPU the slower). Then, one "quick" fix for your issue is asking for the largest GPU and making sure your job asks for enough memory (I used 32GB and was able to run without issues for an hour).

I played with different GPUs and configurations and the summary is as follows.

GPU model	GPU nominal RAM	Used "CPU" RAM	Errored?
gtx 1080	8 GB	7 GB	Yes -- `KeyError: (-1249640991601170458, 7408306308948268275)` and `openmm.OpenMMException: Error initializing FFT: 5`
gtx 2080	8 GB	22 GB	No -- `TERM_RUNLIMIT: job killed after reaching LSF run time limit.`
rtx 6000	16 GB	10 GB	No -- `TERM_RUNLIMIT: job killed after reaching LSF run time limit.`
gtx 2080 Ti	11 GB	9 GB	Yes -- `KeyError: (388033134317850309, -862127015919603845)` and `openmm.OpenMMException: Error creating array pmeGrid1: CUDA_ERROR_OUT_OF_MEMORY (2)`

You can see it varies (no idea why). I also reproduced the CUDA_ERROR_OUT_OF_MEMORY in my local workstation with an nvidia gtx Titan X (6GB). I tried asking for the tesla V100 GPUs in Lilac but it's still on queue.

Note: Maybe a quick hot fix is asking for rtx 6000 GPU directly. That is ly-gpu queue in lilac.

ijpulidos commented 2 years ago

It does seem that there is something going on with how the memory is being handled, since it keeps increasing over time. I'm still not an expert using openmmtools contexts but maybe playing with that could also help. So maybe if you expect your job to be running for much longer, you would need to ask for even more memory.

jchodera commented 2 years ago

We must be leaking memory or OpenMM Context objects. When running ReplicaExchangeSampler, the context cache should only keep a few Context objects cached to keep memory usage constant. When using DummyCache, it should just create and destroy Context objects as needed, taking up the same amount of GPU memory no matter how many replicas there are. There's either a bug in the context cache or in OpenMM itself causing the leak.

mikemhenry commented 2 years ago

@ijpulidos can you try using the dummy cache and see if memory use still grows without bounds? Also try setting the cache to keep only like 12 contexts (assuming replicas 12): context_cache = cache.ContextCache(capacity=12, time_to_live=None) And see if that stops the memory from growing.

ijpulidos commented 2 years ago

@mikemhenry The outcome is the same as far as I can see, it fills GPU memory and continues with CPU, monotonically increasing the used memory. This time it didn't fail for whatever reason, but I think it's still a signal that something is leaking.

mikemhenry commented 2 years ago

Our friends at relaytx have also reported that when GPU memory fills, the system starts to use CPU memory and CPU which is much slower. Their systems are 454 protein residues (7460 protein atoms). The ligands range from 45-65 atoms each and they are using n_states: 12. I need to make sure I don't need to redact anything but I've got some logs of memory using A100 and V100 GPUs.

jchodera commented 2 years ago

Our friends at relaytx have also reported that when GPU memory fills, the system starts to use CPU memory and CPU which is much slower.

OH. This may not be using "CPU memory"---I don't think that's actually what is happening---it might be using the CPU platform, which is indeed much slower! There must be some sort of Context or memory leak with ContextCache here---we really need to look into this carefully.

ijpulidos commented 2 years ago

Using DummyContextCache does indeed help, I can now consistently run the test case for an hour without getting errors, even in my local setup. It seems to be a good workaround right now. That still doesn't really explain why using ContextCache(capacity=None, time_to_live=None) is increasing the used memory over time.

jchodera commented 2 years ago

It definitely looks like the bug is with ContextCache then.

choderalab / openmmtools

Getting `CUDA_ERROR_OUT_OF_MEMORY` error when running replica exchange (with lots of replicas) on large system #521