[Question] How to parallelize replicas in REMD simulations?

felixmusil commented 1 year ago

Hi,

I would like to run REMD simulations using the ReplicaExchangeSampler from this library. I have quite a few replicas for which I would like to run simulations in parallel to reduce the overall runtime. I see that the current implementation is using mpiplus to distribute the workload over MPI workers and it works fine using the standard MPI script call over CPU workers. Is it possible to assign different GPUs to different replicas using the current infrastructure? The best I could achieve is to have several MPI workers that simulate different replicas on the same GPU. If it's not possible at the moment, how can I set a different DeviceIndex to the platforms associated with each thermodynamic_state?

Dan-Burns commented 1 year ago

Based on issue 516 and some of the replex discussion and scripts from @zhang-ivy it's clear you have to run it with a hostfile and configfile.

You can generate the files with clusterutils build_mpirun_configfile.py.

The file contents are very simple and can be reused - just need to change the hostfile contents to whatever node name you're running your job on and you can skip the build_mpirun_configfile for other runs.

Unless I have 4 separate instances of replica exchange writing to the same file, it appears to be running successfully for me.

When I was playing around with it in an interactive session, I had to change the "srun" call in the build_mpirun_configfile.py (line 215) to "mpirun" and manually set 2 environment variables to get build_mpirun_configfile.py to run.

export SLURM_JOB_NODELIST=$HOSTNAME
export CUDA_VISIBLE_DEVICES=0,1,2,3

(or whatever the available device ids are)

The hostfile only contains the node name repeated for the number of gpus you'll use (assuming you're on a single node):

exp-17-59
exp-17-59
exp-17-59
exp-17-59

and configfile contains:

np 1 -env CUDA_VISIBLE_DEVICES 0 python hremd_start.py :
-np 1 -env CUDA_VISIBLE_DEVICES 1 python hremd_start.py :
-np 1 -env CUDA_VISIBLE_DEVICES 2 python hremd_start.py :
-np 1 -env CUDA_VISIBLE_DEVICES 3 python hremd_start.py

Running it just looks like this:

build_mpirun_configfile.py --configfilepath configfile --hostfilepath hostfile "python hremd_start.py"
mpiexec.hydra -f hostfile -configfile configfile

I'll follow up if it turns out I was dreaming.

Dan-Burns commented 1 year ago

Of course the slurm output makes it look like 4 independent processes started since several print statements are repeated 4 times. These are simulation setup steps though and maybe they're being executed several times but ReplicaExchangeSampler run() is probably running a single instance because if I do

lsof -p xxxxx | grep nc

for each gpu process id, only one of the gpu processes is accessing and writing to the log files.

Seems like a good sign.

VasilyAkulovSBP commented 1 year ago

Hi @felixmusil and @Dan-Burns! I am trying to set up HREX MD with mpi, but in the tutorial, I can find only how to run all replicas on a single core. I would like to run each replica independently on several cores, but all replicas on the same GPU (we have 1 GPU per node). Do you know how to set up such a run? When I am using: mpiexec.hydra -n 8 python HREX.py, I get 8 copies of the HREX system (each one consists of 8 thermodynamic_states), and consequently, no calculations speed up.

choderalab / openmmtools

[Question] How to parallelize replicas in REMD simulations? #648