Closed andrrizzi closed 2 years ago
I'm sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the multistate module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.
That sounds like the best thing to try now that you have a working test! You could use git bisect run
to automate this process. Since the samplers moved from YANK to openmmtools
, it could be a simple matter of testing the last version of YANK that included the multistate samplers (which presumably still have the bug) and then bisecting between that and 0.20.1.
Hi, some comments.
First, looking at the environment file in bug report here, the reporter was using YANK 0.23.7, hence that would narrow the range to 0.20.1<->0.23.7 .
Secondly, I have seen this problem with various explicit solvent YANK calculations that I have run myself over two GPUs. More interestingly, if I run the hydration-phenol example, with (default_number_of_iterations: 20000
) and use the count_states.py
script (with a minor mod using openmmtools packages) from the above bug report, I see:
Solvent1 (water)
0 19
1 15
2 19
3 16
4 19
5 15
6 19
7 16
8 19
9 16
10 19
11 17
12 19
13 17
14 19
15 15
16 19
17 6
18 19
and
Solvent2 (vacuum)
0 19
1 19
2 19
3 19
4 19
5 19
6 19
7 19
8 19
9 19
10 19
11 19
12 19
13 19
14 19
15 19
16 19
17 19
18 19
What is interesting here, is that for the vacuum, there is no difference for the odd/even replicas.
I am now speculating: is this something specific to NPT simulations? Could there be an issue with box_vectors and MPI in ReplicaExchangeSampler or its parent class? Do you know if this is seen with the NVT cucurbit[7]uri example over multiple GPUs?
Has this been resolved? I am trying to set up a replica exchange sampler on a multi-GPU node, so could use the command-line and other set-up information needed to reproduce the bug discussed here on openmm-tools (i.e., without yank).
I ran some repex simulations with multiple GPUs recently and was not able to reproduce this bug.
I ran ~80 iterations of h-repex (with 12 replicas, swap-all) on 2 GPUs and am not able to reproduce the problem:
# replica_number, total_number_states_visited
0 12
1 10
2 12
3 12
4 12
5 11
6 12
7 12
8 12
9 12
10 12
11 11
With 6 GPUs (500 iterations):
0 12
1 12
2 12
3 12
4 12
5 12
6 12
7 12
8 12
9 12
10 12
11 12
Using these mpi related packages:
mpi 1.0 mpich conda-forge
mpi4py 3.1.1 py39h6438238_0 conda-forge
mpich 3.4.2 h846660c_100 conda-forge
mpiplus v0.0.1 py39hde42818_1002 conda-forge
We should be able to close this issue unless someone else is still able to reproduce.
Thanks, @zhang-ivy !
Please re-open if this issue reappears.
I'm re-opening this issue because I am seeing the same problematic behavior when running h-repex (36 replicas) on barnase:barstar with 2 gpus:
# replica_number, total_number_states_visited
0 36
1 10
2 36
3 11
4 36
5 13
6 36
7 14
8 36
9 15
10 36
11 14
12 36
13 14
14 36
15 14
16 36
17 14
18 36
19 14
20 36
21 15
22 36
23 12
24 36
25 13
26 36
27 12
28 36
29 12
30 36
31 12
32 36
33 8
34 36
35 6
vs with 1 gpu:
0 36
1 36
2 36
3 36
4 36
5 36
6 36
7 36
8 36
9 36
10 36
11 36
12 36
13 36
14 36
15 36
16 36
17 36
18 36
19 36
20 36
21 36
22 36
23 36
24 36
25 36
26 36
27 36
28 36
29 36
30 36
31 36
32 36
33 36
34 36
35 36
~I tried running the same experiment on alanine dipeptide in vacuum and am not able to reproduce the problem.~ Edit: I actually think the problem is there with ala dipeptide in vacuum as well, its just more subtle.
I'm also not sure why I wasn't seeing the same problem in my previous experiments..
Here is the code to generate the number of states visited per replica, given a .nc
file:
import openmmtools
import os
from perses.analysis import utils
reporter = openmmtools.multistate.MultiStateReporter(os.path.join("0_complex.nc"), 'r')
states = reporter.read_replica_thermodynamic_states()
for i in range(reporter.n_states):
print(i, len(set(states[:,i])))
Attached is the pickled barnase:barstar htf, bash script, and python script for running repex. replica_mixing_issue.zip
cc: @jchodera @ijpulidos
Thanks for providing a test system, @zhang-ivy!
Which version of clusterutils
are you using?
Also, when you ran it, did you see this?
(openmm-dev) [chodera@lilac:replica_mixing_issue]$ build_mpirun_configfile --configfilepath configfile_apo --hostfilepath hostfile_apo "python /home/zhangi/choderalab/perses_benchmark/perses_protein_mutations/code/31_rest_over_protocol/run_rest_over_protocol.py $outdir $phase $n_states $n_cycles $T_max"
Detected MPICH version 4!
Your host and configfiles creation will still be attempted, but you may have problems as build_mpirun_configfile only builds MPICH3 compatible files.
No, I did not see that warning.
clusterutils 0.3.1 pyhd8ed1ab_1 conda-forge
I'm attaching my environment here: openmm-dev.yml.zip
Note that when I run my 1 gpu
version, everything is the same in the bash script, except I changed the number of gpus to 1, i.e. I still use the clusterutils
commands in the bash script for 1 gpu.
Thanks a lot for all the work and effort from @zhang-ivy , specially by pointing me to the differences between yank versions 0.20.1
and 0.21.0
which was when this issue first appeared. I think I managed to come up with a solution.
I made a PR with a probable solution. As far as I could tell, the _replica_thermodynamic_states
attribute was not getting broadcasted to the MPI context. More details in the PR
@zhang-ivy if you can confirm that this solves it it with all your examples and systems it would be really nice as a validation. Just need to install the fix-mpi-replica-mix
branch with something like pip install "git+https://github.com/choderalab/openmmtools.git@fix-mpi-replica-mix"
@ijpulidos : That was the bug! Thanks for catching it after all this time!
@ijpulidos : Great work! Can confirm that your PR fixes the mixing problem for ala dipeptide in solvent t-repex and apo barstar h-repex.
I wanted to create an issue about this in openmmtools as well since we completed the transfering of the
multistate
code from YANK.We're still experiencing mixing problem with the latest version of MPI when multiple GPUs are available (see choderalab/yank#1130 and #407). I've added a test to
test_sampling
checking this in #407, but I still haven't figure out the reason for the bug.I'm sure this was working correctly in YANK during the SAMPLing challenge (i.e., YANK 0.20.1, right before adding the
multistate
module) so the next step would probably be trying binary search on the YANK versions after that to identify where the problem was introduced.