choderalab / openmmtools

A batteries-included toolkit for the GPU-accelerated OpenMM molecular simulation engine.
http://openmmtools.readthedocs.io
MIT License
250 stars 78 forks source link

ReplicaExchangeSampler failing silently on single A100 GPU when too many replicas are used #732

Open k2o0r opened 4 months ago

k2o0r commented 4 months ago

Hi, I've been trying to use openmmtools for some HREX simulations recently and ran into some unusual behaviour.

I have been trying to use MPI and distribute replicas over multiple GPUs, however I had issues getting this to work.

This seems to be related to the OMPI build on my cluster, rather than anything to do with openmmtools, but I next just tried to run all 16 replicas on a single A100-SXM-80GB card (w/ 32 CPU cores and 250 GB RAM also allocated), however jobs would run for long times (~16 hours) without actually changing the size of the reporter file or finishing very short simulations (e.g. 3 iterations).

The exact same code-- 16 replicas on 1 card-- ran, albeit quite slowly, on my workstation, and I can also run 12 replicas on the cluster using the above setup, so I suspect it's to do with the memory requirements of storing all the contexts simultaneously, but it's strange that I don't get any kind of error. The job would run until it ran out of time without ever writing data to the reporter files.

Has this kind of issue been reported before? Do you think it would be possible to add some kind of checks that ensure files are are actually being written/the sampler is actually progressing through iterations?

mikemhenry commented 4 months ago

I think it is worth to keep this issue open to see if others have similar reports but fundamentally

Do you think it would be possible to add some kind of checks that ensure files are are actually being written/the sampler is actually progressing through iterations?

really reduces to the halting problem, we can't really programmatically tell if something is taking a long time (like a big simulation) or stuck in some infinite loop.

I think your intuition here:

The exact same code-- 16 replicas on 1 card-- ran, albeit quite slowly, on my workstation, and I can also run 12 replicas on the cluster using the above setup, so I suspect it's to do with the memory requirements of storing all the contexts simultaneously, but it's strange that I don't get any kind of error.

is likely correct. The lack of an error may be due to the card attempting to use some virtual memory to allocate more pages than it can store which then cycle out of the cache very slowly. I can't remember what sorts of tricks GPUs due these days. In general I find that when trying to parallel things, you just have to figure things out empirically and check things like throwing 2x, 4x, 8x, 16x resources at the problem to see how things scale. Communication overheads and memory swapping can eat into the gains at some point, and as you observed, result in a decrease in performance.