Stalling on multiple SLURM nodes

yanze039 commented 1 year ago

Hi,

I am trying to use yank on 2 nodes with 2 GPU cards on each node.

I can sucessfully run yank example on a single node with 2 GPU cards. But when I run it on 2 nodes with 4 cards in total, the code stalls here forever.

For example, below is the experiments.log:

2023-08-25 00:51:28,908 - DEBUG - yank.experiment - DSL string for the ligand: "resname MOL"
2023-08-25 00:51:28,909 - DEBUG - yank.experiment - DSL string for the solvent: "auto"
2023-08-25 00:51:28,910 - INFO - yank.experiment - Reading phase complex
2023-08-25 00:51:28,910 - DEBUG - yank.pipeline - prmtop: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/complex.prmtop
2023-08-25 00:51:28,910 - DEBUG - yank.pipeline - inpcrd: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/complex.inpcrd
2023-08-25 00:51:37,228 - DEBUG - mpiplus.mpiplus - Node 2/4: waiting for broadcast of <function find_alchemical_counterions at 0x7fbb5f3a9790>
2023-08-25 00:51:37,263 - WARNING - openmmtools.multistate.multistatesampler - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:51:37,305 - DEBUG - openmmtools.multistate.multistatesampler - CUDA devices available: ('uuid, name, compute_mode', 'GPU-6f07f7d1-b74b-ade4-fb35-ff23d46bc4d9, Tesla V100-PCIE-32GB, Default', 'GPU-8485ce30-7bf7-9e47-bdcd-fe7f6156e32c, Tesla V100-PCIE-32GB, Default')
2023-08-25 00:51:37,323 - INFO - yank.experiment - Reading phase solvent
2023-08-25 00:51:37,324 - DEBUG - yank.pipeline - prmtop: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/solvent.prmtop
2023-08-25 00:51:37,324 - DEBUG - yank.pipeline - inpcrd: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/solvent.inpcrd
2023-08-25 00:51:37,734 - DEBUG - mpiplus.mpiplus - Node 2/4: waiting for broadcast of <function find_alchemical_counterions at 0x7fbb5f3a9790>
2023-08-25 00:51:37,735 - WARNING - openmmtools.multistate.multistatesampler - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:51:37,768 - DEBUG - openmmtools.multistate.multistatesampler - CUDA devices available: ('uuid, name, compute_mode', 'GPU-6f07f7d1-b74b-ade4-fb35-ff23d46bc4d9, Tesla V100-PCIE-32GB, Default', 'GPU-8485ce30-7bf7-9e47-bdcd-fe7f6156e32c, Tesla V100-PCIE-32GB, Default')
2023-08-25 00:51:37,828 - WARNING - openmmtools.multistate.multistatereporter - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:51:37,828 - DEBUG - openmmtools.multistate.multistatereporter - Initial checkpoint file automatically chosen as p-xylene-explicit-output-8-cards.4/experiments/complex_checkpoint.nc
2023-08-25 00:51:41,076 - DEBUG - yank.yank - Creating receptor-ligand restraints...
2023-08-25 00:51:41,076 - DEBUG - yank.yank - There are undefined restraint parameters. Trying automatic parametrization.
2023-08-25 00:51:41,156 - DEBUG - yank.restraints - Restraint Harmonic: Automatically picked restrained ligand_atoms atom: ligand_atoms
2023-08-25 00:51:41,188 - DEBUG - yank.restraints - Restraint Harmonic: Automatically picked restrained receptor_atoms atom: receptor_atoms
2023-08-25 00:51:41,271 - DEBUG - yank.restraints - Spring constant sigma, s = 0.522 nm
2023-08-25 00:51:41,271 - DEBUG - yank.restraints - K = 0.0 kcal/mol/A^2
2023-08-25 00:51:44,406 - DEBUG - yank.restraints - Standard state correction: 0.297 kT
2023-08-25 00:51:44,407 - DEBUG - openmmtools.utils.utils - Restraint Harmonic: Computing standard state correction took    0.130s
2023-08-25 00:51:44,407 - DEBUG - yank.yank - Creating alchemically-modified states...
2023-08-25 00:51:45,359 - DEBUG - openmmtools.alchemy - Dictionary of interacting alchemical regions: frozenset()
2023-08-25 00:51:45,359 - DEBUG - openmmtools.alchemy - Using 1 alchemical regions
2023-08-25 00:51:51,735 - DEBUG - openmmtools.alchemy - Adding steric interaction groups between  and the environment.
2023-08-25 00:51:51,747 - DEBUG - openmmtools.alchemy - Adding a steric interaction group between group  and .
2023-08-25 00:51:53,042 - DEBUG - openmmtools.utils.utils - Create alchemically modified system took    7.678s
2023-08-25 00:52:00,778 - DEBUG - yank.yank - Creating expanded cutoff states...
2023-08-25 00:52:01,717 - DEBUG - yank.yank - Setting cutoff for fully interacting system to 16 A. The minimum box dimension is 9.5541862 nm.
2023-08-25 00:52:10,506 - DEBUG - yank.yank - Setting cutoff for fully interacting system to 16 A. The minimum box dimension is 9.5541862 nm.
2023-08-25 00:52:14,348 - DEBUG - yank.yank - Creating sampler object...
2023-08-25 00:52:14,348 - DEBUG - mpiplus.mpiplus - Node 2/4: waiting for broadcast of <bound method MultiStateReporter.storage_exists of <openmmtools.multistate.multistatereporter.MultiStateReporter object at 0x7fbcd7724e20>>
2023-08-25 00:52:14,496 - DEBUG - mpiplus.mpiplus - Node 2/4: waiting for barrier after <function MultiStateSampler._initialize_reporter at 0x7fbb66e56f70>

My SLURM script is like:

#!/bin/bash
#SBATCH -p xeon-g6-volta
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH --export=ALL
#SBATCH --gres=gpu:volta:2
#SBATCH -t 3-24

source /etc/profile
module load anaconda/2023a
source activate yank
set -e

# Run the simulation with verbose output:
echo "Running simulation via MPI..."
export PREFIX="p-xylene-explicit"
build_mpirun_configfile --hostfilepath $PREFIX.hostfile --configfilepath $PREFIX.configfile "yank script --yaml=$PREFIX.yaml"
mpiexec.hydra -f $PREFIX.hostfile -configfile $PREFIX.configfile
date

I can't figure out why it stalls here based on the log file, but I want to figure out whether it is because the code or the configuration of my computational cluster.

mikemhenry commented 1 year ago

This is a good question, looking at the output looks like we only see 2 GPU cards in the output. Does the build_mpirun_configfile look sensible?

yanze039 commented 1 year ago

Thanks for your quick response!

build_mpirun_configfile looks good.

p-xylene-explicit.configfile is like:

-np 1 -env CUDA_VISIBLE_DEVICES GPU-8485ce30-7bf7-9e47-bdcd-fe7f6156e32c yank script --yaml=p-xylene-explicit.yaml :
-np 1 -env CUDA_VISIBLE_DEVICES GPU-6f07f7d1-b74b-ade4-fb35-ff23d46bc4d9 yank script --yaml=p-xylene-explicit.yaml :
-np 1 -env CUDA_VISIBLE_DEVICES GPU-12db7fbe-3e48-a8f0-ae7e-68aa43429796 yank script --yaml=p-xylene-explicit.yaml :
-np 1 -env CUDA_VISIBLE_DEVICES GPU-4d3b6e4c-5eb4-3cf1-6061-94194700d8ae yank script --yaml=p-xylene-explicit.yaml

I also have several other experiments_x.log files, for example, the experiments_1.log is like:

2023-08-25 00:51:28,908 - DEBUG - yank.experiment - DSL string for the ligand: "resname MOL"
2023-08-25 00:51:28,909 - DEBUG - yank.experiment - DSL string for the solvent: "auto"
2023-08-25 00:51:28,910 - INFO - yank.experiment - Reading phase complex
2023-08-25 00:51:28,910 - DEBUG - yank.pipeline - prmtop: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/complex.prmtop
2023-08-25 00:51:28,910 - DEBUG - yank.pipeline - inpcrd: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/complex.inpcrd
2023-08-25 00:51:37,228 - DEBUG - mpiplus.mpiplus - Node 2/4: waiting for broadcast of <function find_alchemical_counterions at 0x7fbb5f3a9790>
2023-08-25 00:51:37,263 - WARNING - openmmtools.multistate.multistatesampler - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:51:37,305 - DEBUG - openmmtools.multistate.multistatesampler - CUDA devices available: ('uuid, name, compute_mode', 'GPU-6f07f7d1-b74b-ade4-fb35-ff23d46bc4d9, Tesla V100-PCIE-32GB, Default', 'GPU-8485ce30-7bf7-9e47-bdcd-fe7f6156e32c, Tesla V100-PCIE-32GB, Default')
2023-08-25 00:51:37,323 - INFO - yank.experiment - Reading phase solvent
2023-08-25 00:51:37,324 - DEBUG - yank.pipeline - prmtop: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/solvent.prmtop
2023-08-25 00:51:37,324 - DEBUG - yank.pipeline - inpcrd: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/solvent.inpcrd
2023-08-25 00:51:37,734 - DEBUG - mpiplus.mpiplus - Node 2/4: waiting for broadcast of <function find_alchemical_counterions at 0x7fbb5f3a9790>
2023-08-25 00:51:37,735 - WARNING - openmmtools.multistate.multistatesampler - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:51:37,768 - DEBUG - openmmtools.multistate.multistatesampler - CUDA devices available: ('uuid, name, compute_mode', 'GPU-6f07f7d1-b74b-ade4-fb35-ff23d46bc4d9, Tesla V100-PCIE-32GB, Default', 'GPU-8485ce30-7bf7-9e47-bdcd-fe7f6156e32c, Tesla V100-PCIE-32GB, Default')
2023-08-25 00:51:37,828 - WARNING - openmmtools.multistate.multistatereporter - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:51:37,828 - DEBUG - openmmtools.multistate.multistatereporter - Initial checkpoint file automatically chosen as p-xylene-explicit-output-8-cards.4/experiments/complex_checkpoint.nc
2023-08-25 00:51:41,076 - DEBUG - yank.yank - Creating receptor-ligand restraints...
2023-08-25 00:51:41,076 - DEBUG - yank.yank - There are undefined restraint parameters. Trying automatic parametrization.
2023-08-25 00:51:41,156 - DEBUG - yank.restraints - Restraint Harmonic: Automatically picked restrained ligand_atoms atom: ligand_atoms
2023-08-25 00:51:41,188 - DEBUG - yank.restraints - Restraint Harmonic: Automatically picked restrained receptor_atoms atom: receptor_atoms
2023-08-25 00:51:41,271 - DEBUG - yank.restraints - Spring constant sigma, s = 0.522 nm
2023-08-25 00:51:41,271 - DEBUG - yank.restraints - K = 0.0 kcal/mol/A^2
2023-08-25 00:51:44,406 - DEBUG - yank.restraints - Standard state correction: 0.297 kT
2023-08-25 00:51:44,407 - DEBUG - openmmtools.utils.utils - Restraint Harmonic: Computing standard state correction took    0.130s
2023-08-25 00:51:44,407 - DEBUG - yank.yank - Creating alchemically-modified states...
2023-08-25 00:51:45,359 - DEBUG - openmmtools.alchemy - Dictionary of interacting alchemical regions: frozenset()
2023-08-25 00:51:45,359 - DEBUG - openmmtools.alchemy - Using 1 alchemical regions
2023-08-25 00:51:51,735 - DEBUG - openmmtools.alchemy - Adding steric interaction groups between  and the environment.
2023-08-25 00:51:51,747 - DEBUG - openmmtools.alchemy - Adding a steric interaction group between group  and .
2023-08-25 00:51:53,042 - DEBUG - openmmtools.utils.utils - Create alchemically modified system took    7.678s
2023-08-25 00:52:00,778 - DEBUG - yank.yank - Creating expanded cutoff states...
2023-08-25 00:52:01,717 - DEBUG - yank.yank - Setting cutoff for fully interacting system to 16 A. The minimum box dimension is 9.5541862 nm.
2023-08-25 00:52:10,506 - DEBUG - yank.yank - Setting cutoff for fully interacting system to 16 A. The minimum box dimension is 9.5541862 nm.
2023-08-25 00:52:14,348 - DEBUG - yank.yank - Creating sampler object...
2023-08-25 00:52:14,348 - DEBUG - mpiplus.mpiplus - Node 2/4: waiting for broadcast of <bound method MultiStateReporter.storage_exists of <openmmtools.multistate.multistatereporter.MultiStateReporter object at 0x7fbcd7724e20>>
2023-08-25 00:52:14,496 - DEBUG - mpiplus.mpiplus - Node 2/4: waiting for barrier after <function MultiStateSampler._initialize_reporter at 0x7fbb66e56f70>

But experiments_2.log is like:

2023-08-25 00:52:26,899 - DEBUG - yank.experiment - DSL string for the ligand: "resname MOL"
2023-08-25 00:52:26,900 - DEBUG - yank.experiment - DSL string for the solvent: "auto"
2023-08-25 00:52:26,907 - INFO - yank.experiment - Reading phase solvent
2023-08-25 00:52:26,907 - DEBUG - yank.pipeline - prmtop: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/solvent.prmtop
2023-08-25 00:52:26,907 - DEBUG - yank.pipeline - inpcrd: p-xylene-explicit-output-8-cards.4/setup/systems/t4-xylene/solvent.inpcrd
2023-08-25 00:52:27,564 - DEBUG - mpiplus.mpiplus - Node 3/4: waiting for broadcast of <function find_alchemical_counterions at 0x7fe1ace23790>
2023-08-25 00:52:27,567 - WARNING - openmmtools.multistate.multistatesampler - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:52:27,593 - DEBUG - openmmtools.multistate.multistatesampler - CUDA devices available: ('uuid, name, compute_mode', 'GPU-4d3b6e4c-5eb4-3cf1-6061-94194700d8ae, Tesla V100-PCIE-32GB, Default', 'GPU-12db7fbe-3e48-a8f0-ae7e-68aa43429796, Tesla V100-PCIE-32GB, Default')
2023-08-25 00:52:27,597 - WARNING - openmmtools.multistate.multistatereporter - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:52:27,597 - DEBUG - openmmtools.multistate.multistatereporter - Initial checkpoint file automatically chosen as p-xylene-explicit-output-8-cards.4/experiments/complex_checkpoint.nc
2023-08-25 00:52:27,972 - DEBUG - openmmtools.multistate.multistatereporter - analysis_particle_indices != on-file analysis_particle_indices!Using on file analysis indices of [   0    1    2 ... 2626 2627 2628]
2023-08-25 00:52:28,015 - WARNING - openmmtools.multistate.multistatereporter - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:52:28,015 - DEBUG - openmmtools.multistate.multistatereporter - Initial checkpoint file automatically chosen as p-xylene-explicit-output-8-cards.4/experiments/complex_checkpoint.nc
2023-08-25 00:52:28,037 - DEBUG - openmmtools.multistate.multistatereporter - analysis_particle_indices != on-file analysis_particle_indices!Using on file analysis indices of [   0    1    2 ... 2626 2627 2628]
2023-08-25 00:52:28,043 - WARNING - openmmtools.multistate.multistatereporter - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:52:28,043 - DEBUG - openmmtools.multistate.multistatereporter - Initial checkpoint file automatically chosen as p-xylene-explicit-output-8-cards.4/experiments/complex_checkpoint.nc
2023-08-25 00:52:28,063 - DEBUG - openmmtools.multistate.multistatereporter - analysis_particle_indices != on-file analysis_particle_indices!Using on file analysis indices of [   0    1    2 ... 2626 2627 2628]
2023-08-25 00:52:28,068 - WARNING - openmmtools.multistate.multistatereporter - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:52:28,068 - DEBUG - openmmtools.multistate.multistatereporter - Initial checkpoint file automatically chosen as p-xylene-explicit-output-8-cards.4/experiments/complex_checkpoint.nc
2023-08-25 00:52:28,088 - DEBUG - openmmtools.multistate.multistatereporter - analysis_particle_indices != on-file analysis_particle_indices!Using on file analysis indices of [   0    1    2 ... 2626 2627 2628]
2023-08-25 00:52:28,106 - WARNING - openmmtools.multistate.multistatesampler - Warning: The openmmtools.multistate API is experimental and may change in future releases
2023-08-25 00:52:28,134 - DEBUG - openmmtools.multistate.multistatesampler - CUDA devices available: ('uuid, name, compute_mode', 'GPU-4d3b6e4c-5eb4-3cf1-6061-94194700d8ae, Tesla V100-PCIE-32GB, Default', 'GPU-12db7fbe-3e48-a8f0-ae7e-68aa43429796, Tesla V100-PCIE-32GB, Default')
2023-08-25 00:52:28,137 - DEBUG - openmmtools.multistate.multistatesampler - Reading storage file p-xylene-explicit-output-8-cards.4/experiments/complex.nc...
2023-08-25 00:52:41,028 - DEBUG - openmmtools.utils.utils - Reading thermodynamic states from storage took    8.689s
2023-08-25 00:52:41,295 - DEBUG - openmmtools.multistate.multistatereporter - read_replica_thermodynamic_states: iteration = 0
2023-08-25 00:52:41,364 - DEBUG - mpiplus.mpiplus - Node 3/4: execute _compute_replica_energies(2)
2023-08-25 00:53:06,842 - DEBUG - mpiplus.mpiplus - Node 3/4: execute _compute_replica_energies(6)
2023-08-25 00:53:06,875 - DEBUG - mpiplus.mpiplus - Node 3/4: execute _compute_replica_energies(10)
2023-08-25 00:53:06,906 - DEBUG - mpiplus.mpiplus - Node 3/4: execute _compute_replica_energies(14)
2023-08-25 00:53:06,936 - DEBUG - mpiplus.mpiplus - Node 3/4: execute _compute_replica_energies(18)
2023-08-25 00:53:06,967 - DEBUG - mpiplus.mpiplus - Node 3/4: execute _compute_replica_energies(22)

They are different.

yanze039 commented 1 year ago

What I am guessing is that, if I distribute different experiments on different nodes, that's fine because replicas only communicate within one node. But here I only have one expriment, and dispatcher distribute replicas on different nodes. The step seems stall on the communication between nodes (I guess). Is it because the communication issue between nodes?

yanze039 commented 1 year ago

What's more interesting is that, when I ssh to the computational node, there are two cpu cores loaded 100% with GPUs no loading.

Super wired the CPUs are fully occupied!

mikemhenry commented 1 year ago

It does look that way, I don't really have a hello world example to point to, but I would see if you can find one that will check to see if MPI can talk across nodes. I will tag in another software scientist who does more MPI stuff than me @ijpulidos

I will say that yank right now doesn't have as much attention as some of our other projects right now so we can only provide support that as beast as we can

yanze039 commented 1 year ago

Below is what I'm saying about cpu work. I am not sure what the cpus are working on while the whole process stalls here...

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND 
1327718 ywang3    20   0   18.3g   1.5g 395972 R 100.0   0.4 818:06.11 yank                                                                                
1327719 ywang3    20   0   18.3g   1.5g 396484 R 100.0   0.4 818:03.64 yank

yanze039 commented 1 year ago

Hi @mikemhenry

I finally solved the bug here.

My solution is to add more threads by:

# for example
export OPENMM_NUM_THREADS=16

I am not very sure why this can work. Below is just my guess. I am useing SLURM cluster. I can sucessfully run my code on a single node, but failed when using different nodes(both GPU and CPU platform). I checked the computational node and found the cpu load is 100%, GPU is 0% but the memory of GPU is occupied. This seems like a deadlock where the comm.gather() can't hear from other nodes. Fianlly I found this bug begins at MultiStateSampler.run() when exec compute_replica_energy(). So probably there are some resource imbalance that SLURM can't handle by single thread when across different nodes. I am not sure.

But anyway, for SLURM cluster, add more threads can solve my problem.

Thanks!

My final script is

#!/bin/bash
#SBATCH -p xeon-g6-volta
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=40
#SBATCH --export=ALL
#SBATCH --gres=gpu:volta:2
#SBATCH -t 3-24

source /etc/profile
module load anaconda/2023a
source activate yank
set -e

# Run the simulation with verbose output:
echo "Running simulation via MPI..."
export PREFIX="p-xylene-explicit"
# add more threads
export OPENMM_NUM_THREADS=16
build_mpirun_configfile --hostfilepath $PREFIX.hostfile --configfilepath $PREFIX.configfile "yank script --yaml=$PREFIX.yaml"
mpiexec.hydra -f $PREFIX.hostfile -configfile $PREFIX.configfile
date

mikemhenry commented 1 year ago

Awesome! Also thanks for the posting the fix, future searchers will thank you :open_hands:

I wonder if there were not enough threads and things got tied up and communication timed out.

choderalab / yank

Stalling on multiple SLURM nodes #1305