replica exchange performance

jlincoff commented 6 years ago

I've been working with the replica exchange module, trying to run standard temperature REMD. I set up an OpenMM script using the proposed format here: openmmtools--Missing feature for Replica Exchange ? I'm running on Titan, which is set up with single-GPU nodes with K80 cards and aprun for job submission. I have openmm, openmmtools, and yank all installed using conda.

I'm finding that the speed decays when switching from just openmm to a "single replica" yank setup, and then further when adding replicas. I sense that some of this is just different output file formats/needs, and then communication time between nodes/cards when adding replicas, but I'm wondering how much should be expected? The degree of slowdown makes me feel like I must have something not configured right, like maybe each replica isn't being properly assigned to its own node/GPU. I'm submitting test jobs using e.g. aprun -n 2 -N 1 python yank_test.py for two replicas.

Thanks!

jchodera commented 6 years ago

If you're talking about ORNL TITAN rather than a GTX Titan, you will need to make sure that each YANK process starts on a separate node, since each node only has one GPU.

Here's an example of a batch queue script for doing this: https://github.com/choderalab/kinase-resistance-mutants/blob/master/hauser-abl-benchmark/yank/run-pbs-titan.sh

Note the use of

aprun -n $PBS_NUM_NODES -N 1 -d 16 yank script --yaml=allmuts-sams.yaml

where each YANK process will run on a separate 16-core node so that each process has access to one GPU.

I found I had to build and install a special version of mpi4py that ran with Cray aprun. To do that:

# Remove mpi4py and install special version for titan
# Make sure to remove glib, since it breaks `aprun`
conda remove --yes --force glib mpi mpich mpi4py

# Build and install special mpi4py for titan
cd $SOFTWARE
wget https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-3.0.0.tar.gz -O mpi4py-3.0.0.tar.gz
tar zxf mpi4py-3.0.0.tar.gz
cd mpi4py-3.0.0

cat >> mpi.cfg <<EOF
[cray]
mpi_dir              = /opt/cray/mpt/7.6.3/gni/mpich-gnu/4.9/
mpicc                = cc
mpicxx               = CC
extra_link_args      = -shared
include_dirs         = %(mpi_dir)s/include
libraries            = mpich
library_dirs         = %(mpi_dir)s/lib/shared:%(mpi_dir)s/lib
runtime_library_dirs = %(mpi_dir)s/lib/shared
EOF

python setup.py build --mpi=cray
python setup.py install

Let me know if you're still running into trouble and we can see if we have any more tricks up our sleeve for TITAN! It's unfortunately a very old, very sick machine, so I wouldn't expect too much from its five-year-old GPUs and horrifically customized OS.

jchodera commented 6 years ago

Also, if you need to install the latest dev version of OpenMM compiled against CUDA 9.1 (which TITAN currently uses), you can do that with

conda install -c omnia/label/cuda91 openmm==7.3.0

We're still planning to migrate the multistate samplers to openmmtools at some point---hopefully soon!

andrrizzi commented 6 years ago

If you're doing temperature REMD (and you're not doing this already), then you could also try to use ParallelTempering instead of the ReplicaExchange class. We haven't used it much, but the computation of the MBAR energy matrix at each iteration should be faster.

Both the Gibbs sampling procedure and the MBAR energy matrix computation scale superlinearly w.r.t. the number of states, and the I/O operations scale more or less linearly so a worsened performance is to be expected, although I'm not sure about the actual numbers.

Also, I'd make sure you are not using GHMCMove, which is presented as an example in the snippet on that thread, unless you require exact sampling of the distribution.

jlincoff commented 6 years ago

Thank you! This is very helpful. Yes, this is on ORNL.

I had been working with the CUDA 9.1 version of openmm, and have now switched in the different version of mpi4py. I'd been using Langevin, and now switching to ParallelTempering from ReplicaExchange has also sped things up!

Oddly enough the mpi4py switch on its own didn't improve the speeds, and slowed down standalone openmm a little, though with switching to ParallelTemperting and the other changes, multi-card runs are faster than they had been. Would you recommend switching to a yaml script instead of plain python?

For reference, the single card/base openmm speed that I'm getting is 57 ns/day for 30,000 atoms and a 2 fs timestep. Two replicas goes down to about 22, and then further from there with additional ones.

andrrizzi commented 5 years ago

Would you recommend switching to a yaml script instead of plain python?

Sorry @jlincoff, I just noticed the question. Going through the YAML script is useful only because it takes care of setting up the simulation automatically using our best practices for alchemical calculations, but it doesn't affect the performance, and if you are doing parallel tempering, I think you're stuck with the Python API.

Two replicas goes down to about 22

Is this two replicas on two parallel processes or a single one? Could you post the log of an iteration? It usually contains timings information of the single parts of the algorithm. You may have to configure the python logger in your driver script to enable logging.DEBUG-level information if you are using the API.

jchodera commented 5 years ago

@jlincoff : Can you provide more information to help us debug this?

choderalab / yank

replica exchange performance #1044