MPI bug when multiple GPUs are used per calculation

andrrizzi commented 4 years ago

I wanted to create an issue about this in openmmtools as well since we completed the transfering of the multistate code from YANK.

We're still experiencing mixing problem with the latest version of MPI when multiple GPUs are available (see choderalab/yank#1130 and #407). I've added a test to test_sampling checking this in #407, but I still haven't figure out the reason for the bug.

I'm sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

jchodera commented 4 years ago

I'm sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.

That sounds like the best thing to try now that you have a working test! You could use git bisect run to automate this process. Since the samplers moved from YANK to openmmtools, it could be a simple matter of testing the last version of YANK that included the multistate samplers (which presumably still have the bug) and then bisecting between that and 0.20.1.

mjw99 commented 4 years ago

Hi, some comments.

First, looking at the environment file in bug report here, the reporter was using YANK 0.23.7, hence that would narrow the range to 0.20.1<->0.23.7 .

Secondly, I have seen this problem with various explicit solvent YANK calculations that I have run myself over two GPUs. More interestingly, if I run the hydration-phenol example, with (default_number_of_iterations: 20000) and use the count_states.py script (with a minor mod using openmmtools packages) from the above bug report, I see:

Solvent1 (water)
0   19 
1   15 
2   19 
3   16 
4   19 
5   15 
6   19 
7   16 
8   19 
9   16 
10   19 
11   17 
12   19 
13   17 
14   19 
15   15 
16   19 
17   6 
18   19

and

Solvent2 (vacuum)
0   19 
1   19 
2   19 
3   19 
4   19 
5   19 
6   19 
7   19 
8   19 
9   19 
10   19 
11   19 
12   19 
13   19 
14   19 
15   19 
16   19 
17   19 
18   19

What is interesting here, is that for the vacuum, there is no difference for the odd/even replicas.

I am now speculating: is this something specific to NPT simulations? Could there be an issue with box_vectors and MPI in ReplicaExchangeSampler or its parent class? Do you know if this is seen with the NVT cucurbit[7]uri example over multiple GPUs?

mpvenkatesh commented 3 years ago

Has this been resolved? I am trying to set up a replica exchange sampler on a multi-GPU node, so could use the command-line and other set-up information needed to reproduce the bug discussed here on openmm-tools (i.e., without yank).

zhang-ivy commented 2 years ago

I ran some repex simulations with multiple GPUs recently and was not able to reproduce this bug.

I ran ~80 iterations of h-repex (with 12 replicas, swap-all) on 2 GPUs and am not able to reproduce the problem:

# replica_number, total_number_states_visited
0 12
1 10
2 12
3 12
4 12
5 11
6 12
7 12
8 12
9 12
10 12
11 11

With 6 GPUs (500 iterations):

Using these mpi related packages:

mpi                       1.0                       mpich    conda-forge
mpi4py                    3.1.1            py39h6438238_0    conda-forge
mpich                     3.4.2              h846660c_100    conda-forge
mpiplus                   v0.0.1          py39hde42818_1002    conda-forge

We should be able to close this issue unless someone else is still able to reproduce.

jchodera commented 2 years ago

Thanks, @zhang-ivy !

Please re-open if this issue reappears.

zhang-ivy commented 2 years ago

I'm re-opening this issue because I am seeing the same problematic behavior when running h-repex (36 replicas) on barnase:barstar with 2 gpus:

# replica_number, total_number_states_visited
0 36
1 10
2 36
3 11
4 36
5 13
6 36
7 14
8 36
9 15
10 36
11 14
12 36
13 14
14 36
15 14
16 36
17 14
18 36
19 14
20 36
21 15
22 36
23 12
24 36
25 13
26 36
27 12
28 36
29 12
30 36
31 12
32 36
33 8
34 36
35 6

vs with 1 gpu:

~I tried running the same experiment on alanine dipeptide in vacuum and am not able to reproduce the problem.~ Edit: I actually think the problem is there with ala dipeptide in vacuum as well, its just more subtle.

I'm also not sure why I wasn't seeing the same problem in my previous experiments..

Here is the code to generate the number of states visited per replica, given a .nc file:

import openmmtools
import os
from perses.analysis import utils
reporter = openmmtools.multistate.MultiStateReporter(os.path.join("0_complex.nc"), 'r')
states = reporter.read_replica_thermodynamic_states()
for i in range(reporter.n_states):
    print(i, len(set(states[:,i])))

Attached is the pickled barnase:barstar htf, bash script, and python script for running repex. replica_mixing_issue.zip

cc: @jchodera @ijpulidos

jchodera commented 2 years ago

Thanks for providing a test system, @zhang-ivy!

Which version of clusterutils are you using? Also, when you ran it, did you see this?

(openmm-dev) [chodera@lilac:replica_mixing_issue]$ build_mpirun_configfile --configfilepath configfile_apo --hostfilepath hostfile_apo "python /home/zhangi/choderalab/perses_benchmark/perses_protein_mutations/code/31_rest_over_protocol/run_rest_over_protocol.py $outdir $phase $n_states $n_cycles $T_max"
Detected MPICH version 4! 
Your host and configfiles creation will still be attempted, but you may have problems as build_mpirun_configfile only builds MPICH3 compatible files.

zhang-ivy commented 2 years ago

No, I did not see that warning.

clusterutils              0.3.1              pyhd8ed1ab_1    conda-forge

I'm attaching my environment here: openmm-dev.yml.zip

zhang-ivy commented 2 years ago

Note that when I run my 1 gpu version, everything is the same in the bash script, except I changed the number of gpus to 1, i.e. I still use the clusterutils commands in the bash script for 1 gpu.

ijpulidos commented 2 years ago

Thanks a lot for all the work and effort from @zhang-ivy , specially by pointing me to the differences between yank versions 0.20.1 and 0.21.0 which was when this issue first appeared. I think I managed to come up with a solution.

I made a PR with a probable solution. As far as I could tell, the _replica_thermodynamic_states attribute was not getting broadcasted to the MPI context. More details in the PR

@zhang-ivy if you can confirm that this solves it it with all your examples and systems it would be really nice as a validation. Just need to install the fix-mpi-replica-mix branch with something like pip install "git+https://github.com/choderalab/openmmtools.git@fix-mpi-replica-mix"

jchodera commented 2 years ago

@ijpulidos : That was the bug! Thanks for catching it after all this time!

zhang-ivy commented 2 years ago

@ijpulidos : Great work! Can confirm that your PR fixes the mixing problem for ala dipeptide in solvent t-repex and apo barstar h-repex.

choderalab / openmmtools

MPI bug when multiple GPUs are used per calculation #449