choderalab / openmmtools

A batteries-included toolkit for the GPU-accelerated OpenMM molecular simulation engine.
http://openmmtools.readthedocs.io
MIT License
244 stars 76 forks source link

[Question] How would one use MPI with the ReplicaExchangeSampler? #738

Open PSMusicalRoc opened 1 month ago

PSMusicalRoc commented 1 month ago

Hi there!

I've been futzing around with the ReplicaExchangeSampler class, trying my best to figure out how the Python code in my main file should be laid out to actually run a simulation in MPI. Is it possible to get a minimal working example of a simulation running in MPI so that I can adjust it to fit my needs?

For context, I would be running a simulation on the SLURM job management system, parallelized across CPU cores. GPUs would not be a part of this simulation.

Thank you in advance!

schuhmc commented 2 weeks ago

I would also be interested in this. I am trying to run a ReplicaExchangeSampler using multiple nodes with one GPU each for free energy calculations.

~~The following code seems to work for me and changes a global parameter 'lambda_en'. I do see simultaneous usage of all used GPUs when running the simulation, but the speedup is minimal (400 ns/day on a single node vs 600 ns/day on 4 nodes). I would have expected that parallelizing the simulations of the different replicas would lead to a much larger speedup, essentially running one replica per node at a time. But maybe I am misunderstanding the MPI implementation here? To me it seems more like one simulation is spread out over multiple nodes instead.~~

Update: I managed to get it working. I now get around 1700 ns/day using 4 nodes and 8 replicas which is very close to perfect scaling.

The only thing I needed to change was the group_size parameter inside the openmmtools code: https://github.com/choderalab/openmmtools/blob/9fc8ab74f16f957dbb74215bf616c70aeafbf13f/openmmtools/multistate/multistatesampler.py#L1301-L1302

Simply append 'group_size=1' to the function call.

As far as I can see, it otherwise calls the function with group_size=None and mpiplus tries to figure it out by itself. In my case that lead to the behavior described above. I am not sure if that is intended, or if I'm just missing some kind of environmental variable, but it seems to work for me.

If desired, I can submit a pull request for a corresponding change.

Anyways, here is the code I came up with so far:


from mpi4py import MPI
from openmm import unit, XmlSerializer, app
import openmm as mm
from openmmtools import states, mcmc, multistate
import numpy as np
import os

# Deserialize system and load pdb file
system = XmlSerializer.deserializeSystem(open('system.xml', 'r').read())
pdbFile = app.PDBFile('eq.pdb')

n_replicas = 8
lambdas = np.round(np.linspace(0, 1, n_replicas), 8)

class LambdaState(states.GlobalParameterState):
    lambda_en = states.GlobalParameterState.GlobalParameter('lambda_en', 0.)

# Create thermodynamic states
thermodynamic_states = []
for lambda_value in lambdas:
    thermodynamic_state = states.ThermodynamicState(system=system, temperature=300*unit.kelvin)
    lambda_state = LambdaState.from_system(system)
    lambda_state.lambda_en = lambda_value
    compound_state = states.CompoundThermodynamicState(thermodynamic_state, composable_states=[lambda_state])
    thermodynamic_states.append(compound_state)

# MCMC move setup
mcmc_move = mcmc.LangevinDynamicsMove(timestep=2*unit.femtosecond, n_steps=50000)

# ReplicaExchangeSampler setup
simulation = multistate.ReplicaExchangeSampler(mcmc_moves=mcmc_move, number_of_iterations=10, online_analysis_interval=1, online_analysis_target_error=0., replica_mixing_scheme='swap-neighbors')

reporter = multistate.MultiStateReporter(f'output.nc', checkpoint_interval=1)

# Initialize simulation
simulation.create(thermodynamic_states=thermodynamic_states,
                  sampler_states=states.SamplerState(pdbFile.positions, box_vectors=system.getDefaultPeriodicBoxVectors()),
                  storage=reporter)

# Run simulation
simulation.equilibrate(2)
simulation.run()

Which I submit to our SLURM cluster with the following script using sbatch run.sh

#!/bin/bash
#SBATCH --job-name=hrex_test
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=24:00:00
#SBATCH --gpus-per-task=1
#SBATCH --partition=NNNGN

source /home/schuhmarc/anaconda3/etc/profile.d/conda.sh
conda activate openMMCluster

mpirun python hrex_cluster_test.py

(I have also tested this on CPU nodes and it seems to work there as well)

Note that I am not experienced in MPI at all, so take all this with a grain of salt. I would be very happy about additional input here as well.

k2o0r commented 2 weeks ago

I tested out the modification in multistatesampler.py, and with the settings and software on my cluster it did not seem to lead to the correct behavior with just mpirun python script.py, when attempting to use MPI to distribute replicas across 3 GPUs on a single node.

I'm wondering if in your case you can get the correct scaling in this way because you seem to have only 1 GPU per node on your cluster? If you want to use MPI within a node, you have to use a command similar to this:

bash generate_files.sh
mpirun --hostfile hostfile --app appfile

Where the hostfile contains the name of the node repeated once for each process:

gpu022
gpu022
gpu022

And the appfile contains:

-np 1 -x CUDA_VISIBLE_DEVICES=0 python script.py
-np 1 -x CUDA_VISIBLE_DEVICES=1 python script.py
-np 1 -x CUDA_VISIBLE_DEVICES=2 python script.py

Here I've got 3 GPUs in one node, so I'm launching 3 processes with each one explicitly allocated one GPU, but this approach should also work fine for multiple nodes (ofc depending on cluster settings), we just have to launch one process per GPU.

I've made a simple bash script to generate files in the correct format for my hardware/MPI setup, with 3 GPUs/node and OpenMPI.

For MPICH you can use the clusterutils package also from the Chodera lab to generate MPICH compatible hostfiles and configfiles which should detect number and configuration of GPUs automatically, and the command becomes something like:

build_mpirun_configfile python script.py
mpiexec.hydra --hostfile hostfile --configfile configfile

An important point is that you need to install/load an mpi4py version that's compatible with your cluster's MPI build for this to work properly. I haven't tested it, but one should be able to use a similar approach to get this working with the CPU platform.

Regarding the python script, you use the same python script whether running with MPI or not. I've used openmmtools for repex without MPI quite a lot because this MPI setup can be quite fiddly, the python scripts are identical in both cases.

schuhmc commented 2 weeks ago

I'm wondering if in your case you can get the correct scaling in this way because you seem to have only 1 GPU per node on your cluster?

That is true, we only have single-GPU nodes in our cluster. If you look at previous issues, it seems that otherwise some kind of masking is indeed necessary.

While investigating, I have also tried running with the option --host node01,node02, although without the --app option. However, without the modification in multistatesampler.py the simulations did not seem to run in parallel and the speedup of using multiple nodes was minimal.

With the modification, get about 1150 ns/day on three nodes and without the modification this performance drops down to 576 ns/day on the same three nodes.

I am not sure if you do need to define the hosts you are running on, as per the mpirun documentation this should not be necessary in a SLURM environment ( see here ). Note that this should also be true for older versions (the version we are using is 3.1.3)

On that note, I was not able to start my jobs with mpirun --hostfile hostfile --app appfile using a corresponding hostfile and appfile as you specified but this is probably an issue on my end.

However, there is also some interesting discussion about this in #713 with a script gpu_bind.sh which should also be able to handle GPU masking.

For MPICH you can use the clusterutils package also from the Chodera lab

Just note that this has not been updated for 7 years already.

Anyways, could you maybe check whether my proposed change above does make a difference in performance for you in both 1 node multi-GPU and multi-node single-GPU each setups? I don't have access to any multi-GPU nodes atm.