Closed zhang-ivy closed 2 years ago
@zhang-ivy I've been playing a bit with this and I can see that the system getting out of memory is real. I didn't know that OpenMM was capable of processing things in both the system memory ("CPU" RAM) and the GPU. So, when it runs out of memory in the GPU it starts using the System Memory (at the cost of performance of course, the more it is on CPU the slower). Then, one "quick" fix for your issue is asking for the largest GPU and making sure your job asks for enough memory (I used 32GB and was able to run without issues for an hour).
I played with different GPUs and configurations and the summary is as follows.
GPU model | GPU nominal RAM | Used "CPU" RAM | Errored? |
---|---|---|---|
gtx 1080 | 8 GB | 7 GB | Yes -- KeyError: (-1249640991601170458, 7408306308948268275) and openmm.OpenMMException: Error initializing FFT: 5 |
gtx 2080 | 8 GB | 22 GB | No -- TERM_RUNLIMIT: job killed after reaching LSF run time limit. |
rtx 6000 | 16 GB | 10 GB | No -- TERM_RUNLIMIT: job killed after reaching LSF run time limit. |
gtx 2080 Ti | 11 GB | 9 GB | Yes -- KeyError: (388033134317850309, -862127015919603845) and openmm.OpenMMException: Error creating array pmeGrid1: CUDA_ERROR_OUT_OF_MEMORY (2) |
You can see it varies (no idea why). I also reproduced the CUDA_ERROR_OUT_OF_MEMORY in my local workstation with an nvidia gtx Titan X (6GB). I tried asking for the tesla V100 GPUs in Lilac but it's still on queue.
Note: Maybe a quick hot fix is asking for rtx 6000 GPU directly. That is ly-gpu
queue in lilac.
It does seem that there is something going on with how the memory is being handled, since it keeps increasing over time. I'm still not an expert using openmmtools contexts but maybe playing with that could also help. So maybe if you expect your job to be running for much longer, you would need to ask for even more memory.
We must be leaking memory or OpenMM Context
objects. When running ReplicaExchangeSampler
, the context cache should only keep a few Context
objects cached to keep memory usage constant. When using DummyCache
, it should just create and destroy Context
objects as needed, taking up the same amount of GPU memory no matter how many replicas there are. There's either a bug in the context cache or in OpenMM itself causing the leak.
@ijpulidos can you try using the dummy cache and see if memory use still grows without bounds? Also try setting the cache to keep only like 12 contexts (assuming replicas 12):
context_cache = cache.ContextCache(capacity=12, time_to_live=None)
And see if that stops the memory from growing.
@mikemhenry The outcome is the same as far as I can see, it fills GPU memory and continues with CPU, monotonically increasing the used memory. This time it didn't fail for whatever reason, but I think it's still a signal that something is leaking.
Our friends at relaytx have also reported that when GPU memory fills, the system starts to use CPU memory and CPU which is much slower. Their systems are 454 protein residues (7460 protein atoms). The ligands range from 45-65 atoms each and they are using n_states: 12
. I need to make sure I don't need to redact anything but I've got some logs of memory using A100 and V100 GPUs.
Our friends at relaytx have also reported that when GPU memory fills, the system starts to use CPU memory and CPU which is much slower.
OH. This may not be using "CPU memory"---I don't think that's actually what is happening---it might be using the CPU
platform, which is indeed much slower! There must be some sort of Context
or memory leak with ContextCache
here---we really need to look into this carefully.
Using DummyContextCache
does indeed help, I can now consistently run the test case for an hour without getting errors, even in my local setup. It seems to be a good workaround right now. That still doesn't really explain why using ContextCache(capacity=None, time_to_live=None)
is increasing the used memory over time.
It definitely looks like the bug is with ContextCache
then.
I'm trying to run replica exchange (rest) on a system with 185K atoms, and it runs fine when I use a rest region of ~70 atoms with 12 replicas. However, I get the following error when I run with a rest region of ~1000 atoms with 24 or 36 replicas.
I'm currently using this as the context cache:
context_cache = cache.ContextCache(capacity=None, time_to_live=None)
To reproduce the issue, I can point to where the files are located on lilac (the files are large, so it doesn't make sense to drop them here): In
/data/chodera/zhangi/perses_benchmark/neq/14/147/for_debugging/
:perses-rbd-ace2-direct.yml
-- contains a yaml of the env I'm usinggenerate_rest_cache_interface.py
-- rest scriptrun_rest2_complex_for_debugging.sh
-- bash script147_complex_1.pickle
-- pickled RepartitionedHybridTopologyFactory needed for rest script147_complex_1_rest.pickle
-- pickled RESTTopologyFactory needed for rest script147/
and place the factories and scripts in that directory147/
directory There are also xmls for the system (the REST system), state (at thermodynamic state with temp 298 K), and integrator (attached to the context cache --though i dont think this is actually being used anywhere).Some observations I made:
lt
,ly
,lx
, andlu
. I've also seen absence of an error on these nodes.