LangevinSplittingDynamicsMove sometimes nans

@dominicrufa and I were trying to run repex on the endstates of a capped amino acid transformation (ALA->THR) (using the new RepartitionedHybridTopologyFactory in perses) with softened electrostatics, sterics, and torsion. We observed that when we use the LangevinSplittingDynamicsMove, the simulation runs fine at the ALA endstate, but nans immediately for the THR endstate. When we use GHMCMove instead, the simulation runs slower, but doesn't nan.

In the example code below, I show that the nans occur when I apply the LangevinSplittingDynamicsMove (i.e. without even intializing the repex sampler). Note that when I ran 1000 steps of MD at each of the 11 thermodynamic states using the LangevinIntegrator, no nans occurred.

Here is the code:

from perses.tests.test_topology_proposal import generate_atp, generate_dipeptide_top_pos_sys
from openmmtools.alchemy import AbsoluteAlchemicalFactory, AlchemicalRegion, AlchemicalState
from openmmtools import states, cache
from simtk import openmm, unit
import copy
from perses.dispersed import feptasks
from openmmtools.mcmc import LangevinSplittingDynamicsMove, GHMCMove
from openmmtools.multistate import ReplicaExchangeSampler, MultiStateReporter

# Generate htf at lambda = 1
atp, sys_gen = generate_atp()
htf_1 = generate_dipeptide_top_pos_sys(atp.topology, 
                                         new_res = 'THR', 
                                         system = atp.system, 
                                         positions = atp.positions,
                                         system_generator = sys_gen, 
                                         conduct_htf_prop=True,
                                         repartitioned=True, endstate=1, validate_endstate_energy=False)

# Alchemify the system
atoms_to_alchemify = list(htf_1._atom_classes['unique_new_atoms']) + list(htf_1._atom_classes['unique_old_atoms'])
alch_factory = AbsoluteAlchemicalFactory(consistent_exceptions=False)
alchemical_region = AlchemicalRegion(alchemical_atoms=list(atoms_to_alchemify), alchemical_torsions=True)
alchemical_system = alch_factory.create_alchemical_system(htf_1.hybrid_system, alchemical_region)

# Initialize compound thermodynamic states at different temperatures and alchemical states.
protocol = {'temperature': [300]*unit.kelvin*11,
            'lambda_electrostatics': [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0],
            'lambda_sterics': [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0],
           'lambda_torsions': [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]}
alchemical_state = AlchemicalState.from_system(alchemical_system)
compound_states = states.create_thermodynamic_state_protocol(alchemical_system, 
                                                             protocol=protocol, 
                                                             composable_states=[alchemical_state])

# Minimize w.r.t each thermodynamic state
context_cache = cache.ContextCache()
sampler_state_list = []
sampler_state = states.SamplerState(htf_1.hybrid_positions, box_vectors=htf_1.hybrid_system.getDefaultPeriodicBoxVectors())
for i, compound_thermodynamic_state in enumerate(compound_states):
    print(i)
     # now generating a sampler_state for each thermodyanmic state, with relaxed positions
    context, context_integrator = context_cache.get_context(compound_thermodynamic_state)
    feptasks.minimize(compound_thermodynamic_state, sampler_state)
    sampler_state_list.append(copy.deepcopy(sampler_state))

# Set up the LangevinnSplittingDynamicsMove
n_steps = 500 # 1 ps
n_iterations = 1000 # 1 ns
langevin_move = LangevinSplittingDynamicsMove(timestep=2.0*unit.femtosecond, n_steps=n_steps)

# Apply move
for i, (sampler_state, thermostate) in enumerate(zip(sampler_state_list, compound_states)):
    print(i)
    langevin_move.apply(thermostate, sampler_state)

Here is the output from the for loop in the last code snippet:

1
2
3
4
5
WARNING:openmmtools.mcmc:Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
WARNING:openmmtools.mcmc:Potential energy is NaN after 2 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
ERROR:openmmtools.mcmc:Potential energy is NaN after 3 attempts of integration with move LangevinSplittingDynamicsMove Trying to reinitialize Context as a last-resort restart attempt...
ERROR:openmmtools.mcmc:Potential energy is NaN after 4 attempts of integration with move LangevinSplittingDynamicsMove
---------------------------------------------------------------------------
IntegratorMoveError                       Traceback (most recent call last)
<ipython-input-11-4b2538f4dd1b> in <module>
      2 for i, (sampler_state, thermostate) in enumerate(zip(sampler_state_list, compound_states)):
      3     print(i)
----> 4     langevin_move.apply(thermostate, sampler_state)
      5 

~/miniconda3/envs/perses-sims/lib/python3.7/site-packages/openmmtools/mcmc.py in apply(self, thermodynamic_state, sampler_state)
   1112         """
   1113         # Explicitly implemented just to have more specific docstring.
-> 1114         super(LangevinDynamicsMove, self).apply(thermodynamic_state, sampler_state)
   1115 
   1116     def __getstate__(self):

~/miniconda3/envs/perses-sims/lib/python3.7/site-packages/openmmtools/mcmc.py in apply(self, thermodynamic_state, sampler_state)
    705                     sampler_state.apply_to_context(context)
    706                     logger.error(err_msg)
--> 707                     raise IntegratorMoveError(err_msg, self, context)
    708                 else:
    709                     logger.warning(err_msg + ' Attempting a restart...')

IntegratorMoveError: Potential energy is NaN after 4 attempts of integration with move LangevinSplittingDynamicsMove

I've been seeing the exact same error, using a modified version of the ReplicaExchange class from this tutorial. I tried increasing n_restart_attempts to 100 but it didn't help at all!

Here's my (pretty bad) code, almost entirely copied from the tutorial:

import math, yaml, sys, cctk, os
from datetime import datetime
import numpy as np
from random import random, randint

import mdtraj
from simtk import openmm, unit
from simtk.openmm.app import AmberPrmtopFile, AmberInpcrdFile, PME
from openmmtools import testsystems, integrators, forces, alchemy, states, cache, mcmc
import cctk.helper_functions as helper

# usage: python run_rest2.py config.yaml

class ReplicaExchange:
    def __init__(self, thermodynamic_states, sampler_states, mcmc_move, save_interval=1000, swap_interval=1000, total_steps=10000, logfile="rest2.log", traj_prefix="rest2"):
        self._thermodynamic_states = thermodynamic_states
        self._replicas_sampler_states = sampler_states
        self._mcmc_move = mcmc_move

        self.save_interval = save_interval
        self.swap_interval = swap_interval
        self.total_steps = total_steps
        self.current_step = 0

        self.logfile = logfile
        self.traj_prefix = traj_prefix

        # initial minimization
        self.log("beginning energy minimization")
        for thermo_state, sampler_state in zip(self._thermodynamic_states, self._replicas_sampler_states):
            context, integrator = cache.global_context_cache.get_context(thermo_state)
            sampler_state.apply_to_context(context)
            openmm.LocalEnergyMinimizer.minimize(context)
            sampler_state.update_from_context(context)

    def log(self, text):
        with open(self.logfile, "a") as f:
            time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            f.write(f"{time}\t{text}\n")

    def run(self):
        self.log("beginning replica exchange simulation")
        for iteration in range(int(self.total_steps/self.save_interval)):
            self._propagate_replicas()
            self.current_step += self.save_interval
            with open(self.logfile, "a") as f:
                time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
                f.write(f"{time}\t{self.current_step} steps completed\n")

            self.save_positions()
            if self.current_step % self.swap_interval == 0:
                self._mix_replicas()

    def _propagate_replicas(self):
        for thermo_state, sampler_state in zip(self._thermodynamic_states, self._replicas_sampler_states):
            self._mcmc_move.apply(thermo_state, sampler_state)

    def _mix_replicas(self, n_attempts=5):
        for i in range(len(self._thermodynamic_states) - 1):
            j = i+1
            sampler_state_i, sampler_state_j = (self._replicas_sampler_states[k] for k in [i, j])
            thermo_state_i, thermo_state_j = (self._thermodynamic_states[k] for k in [i, j])

            energy_ii = self._compute_reduced_potential(sampler_state_i, thermo_state_i)
            energy_jj = self._compute_reduced_potential(sampler_state_j, thermo_state_j)
            energy_ij = self._compute_reduced_potential(sampler_state_i, thermo_state_j)
            energy_ji = self._compute_reduced_potential(sampler_state_j, thermo_state_i)

            log_p_accept = - (energy_ij + energy_ji) + energy_ii + energy_jj
            if log_p_accept >= 0.0 or random() < math.exp(log_p_accept):
                self._thermodynamic_states[i] = thermo_state_j
                self._thermodynamic_states[j] = thermo_state_i

                with open(self.logfile, "a") as f:
                    f.write(f"\tswapping {i} and {j} ({log_p_accept:.3f})\n")

    def _compute_reduced_potential(self, sampler_state, thermo_state):
        context, integrator = cache.global_context_cache.get_context(thermo_state)
        sampler_state.apply_to_context(context)
        return thermo_state.reduced_potential(context)

    def save_positions(self):
       ... # omitted for space

config_file = sys.argv[1]
with open(config_file, "r") as f:
    config = yaml.safe_load(f)

assert config["swap_interval"] % config["save_interval"] == 0, "swap_interval must be a multiple of save_interval"
num_replicas = config["num_replicas"]
prmtop = AmberPrmtopFile(config["prmtop"])
inpcrd = AmberInpcrdFile(config["inpcrd"])

system = prmtop.createSystem(nonbondedMethod=PME, nonbondedCutoff=config["nonbonded_cutoff"] * unit.angstrom, constraints=None)
mdtop = mdtraj.Topology.from_openmm(prmtop.topology)
high_atoms = mdtop.select(f"resname {config['high_resname']}")

min_L = config["min_lambda"]
protocol = {
    'temperature'           : [config["temperature"]] * num_replicas * unit.kelvin,
    'lambda_sterics'        : list(np.geomspace(1.0, min_L, num_replicas)),
    'lambda_electrostatics' : list(np.geomspace(1.0, min_L, num_replicas)),
    'lambda_angles'         : list(np.geomspace(1.0, min_L, num_replicas)),
    'lambda_torsions'       : list(np.geomspace(1.0, min_L, num_replicas)),
}

alchemical_region = alchemy.AlchemicalRegion(alchemical_atoms=high_atoms, alchemical_angles=True, alchemical_torsions=True)
factory = alchemy.AbsoluteAlchemicalFactory()
alchemical_system = factory.create_alchemical_system(system, alchemical_region)

alchemical_state = alchemy.AlchemicalState.from_system(alchemical_system)
compound_states = states.create_thermodynamic_state_protocol(alchemical_system, protocol=protocol, composable_states=[alchemical_state])
sampler_states = [states.SamplerState(positions=inpcrd.getPositions(), box_vectors=inpcrd.getBoxVectors()) for _ in compound_states]
langevin_move = mcmc.LangevinSplittingDynamicsMove(timestep=0.5*unit.femtosecond, n_steps=config["swap_interval"])
langevin_move.n_restart_attempts = 100

parallel_tempering = ReplicaExchange(
    compound_states,
    sampler_states,
    langevin_move,
    total_steps=config["total_steps"],
    swap_interval=config["swap_interval"],
    save_interval=config["save_interval"],
    logfile=config["logfile"],
    traj_prefix=config["traj_prefix"],
)

parallel_tempering.run()

Using the following input file, I only make it through one LangevinSplittingDynamicsMove per replica before dying. This is surprising because the LocalEnergyMinimizer runs without dying, so it doesn't seem like the system is just bogus/contains massive conflicts. Visually, everything looks fine (inputs generated using antechamber/parmchk/PACKMOL/tleap).

config.yaml:

prmtop: solvated_13precomplex.prmtop
inpcrd: solvated_13precomplex.inpcrd
logfile: 13precomplex.log
traj_prefix: replica

nonbonded_cutoff: 8 # Å
temperature: 298
high_resname: MOL

total_steps: 5000000
swap_interval: 2000
save_interval: 500

num_replicas: 24
min_lambda: 0.66

Output:

Potential energy is NaN after 0 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
Potential energy is NaN after 1 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
Potential energy is NaN after 2 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
Potential energy is NaN after 3 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
...
Potential energy is NaN after 98 attempts of integration with move LangevinSplittingDynamicsMove Attempting a restart...
Potential energy is NaN after 99 attempts of integration with move LangevinSplittingDynamicsMove Trying to reinitialize Context as a last-resort restart attempt...
Potential energy is NaN after 100 attempts of integration with move LangevinSplittingDynamicsMove
Traceback (most recent call last):
  File "run_rest2.py", line 142, in <module>
    parallel_tempering.run()
  File "run_rest2.py", line 44, in run
    self._propagate_replicas()
  File "run_rest2.py", line 56, in _propagate_replicas
    self._mcmc_move.apply(thermo_state, sampler_state)
  File "/n/home03/cwagen/.conda/envs/openmm/lib/python3.7/site-packages/openmmtools/mcmc.py", line 1114, in apply
    super(LangevinDynamicsMove, self).apply(thermodynamic_state, sampler_state)
  File "/n/home03/cwagen/.conda/envs/openmm/lib/python3.7/site-packages/openmmtools/mcmc.py", line 707, in apply
    raise IntegratorMoveError(err_msg, self, context)
openmmtools.mcmc.IntegratorMoveError: Potential energy is NaN after 100 attempts of integration with move LangevinSplittingDynamicsMove

Any advice would be greatly appreciated!

Are you both still seeing this with OpenMM 7.7.0 and OpenMMtools 0.21.0? I am seeing something similar with an internal test system and want to debug this more.

Just so that we are all on the same page, would it be possible to generate a minimal, self-contained example (perhaps using openmmtools.testsystems) that fails, that we can all run (and agree on)? I cannot run zhang-ivy's example because I do not have a OpenEye license and I cannot run the second example since I do not have solvated_13precomplex.{prmtop, inpcrd}.

I posted an issue (#642) similar to this.

I am using openmm 7.7.0 and openmmtools 0.21.5. I also get the same issue with openmm 7.6.0.

I have tried to use ReplicaExchangeSampler with 3 different systems, thinking it might be stemming from a poorly equilibrated system. I'm fairly certain now that the issue is not with the input structure's state.

my lambdas: [1.0, 0.9839948356327152, 0.9682458365518544, 0.9527489027899027, 0.9375]

If I ReplicaExchangeSamper.minimize() and then use a 2 femtosecond timestep with the above minor spread of lambdas, I get the NaN error apparently on the attempt to propagate the 5th (final) system SimulationNaNError: Propagating replica 4 at state 4 resulted in a NaN! The state of the system and integrator before the error were saved in replex_output/nan-error-logs

I believe this has occurred on the second replica too.

If I reduce the timestep to 0.01 femtoseconds, ReplicaExchangeSamper.equilibrate(300) will complete but one or more of the system(s) are in the process of blowing up. Results seem to be different depending on LangevinDynamicsMove or LangevinSplittingDynamicsMove with only the final system looking exploded for the former and all of them for the latter.

positions, topology = PDBFile(eq_structure).positions, PDBFile(eq_structure).topology
with open(f'output/1zip_system.xml') as infile:
        system = XmlSerializer.deserialize(infile.read())

# if applying lambda to torsions then AlchemicalRegion(alchemical_torsions=True)
alchemical_region = alchemy.AlchemicalRegion(alchemical_atoms=guest_atoms,alchemical_torsions=True)
factory           = alchemy.AbsoluteAlchemicalFactory()
alchemical_system = factory.create_alchemical_system(system, alchemical_region)
alchemical_state  = alchemy.AlchemicalState.from_system(alchemical_system)

# Initialize compound thermodynamic states at different temperatures and alchemical states.
# for the Hamiltonian approach - same temperature but different lambdas.
protocol = {'temperature': [300 for i in range(n_replicas)] * kelvin,
            'lambda_electrostatics': lambdas,
            'lambda_sterics': lambdas,
            'lambda_torsions': lambdas}

thermodynamic_states = states.create_thermodynamic_state_protocol(alchemical_system,
                                                                  protocol=protocol,
                                                                  composable_states=[alchemical_state])

box_vectors = system.getDefaultPeriodicBoxVectors()

sampler_states = [states.SamplerState(positions=positions, box_vectors=box_vectors) 
                  for i,_ in enumerate(thermodynamic_states)]

friction = 1.0 / openmm.unit.picosecond

move = mcmc.LangevinSplittingDynamicsMove(timestep=0.01*femtosecond,
                                                   n_steps=1000,
                                          collision_rate=friction
                                                   )

# Different results with different move
'''
move = mcmc.LangevinDynamicsMove(timestep=.01*femtosecond,
                                                   n_steps=1000,
                                    collision_rate=friction
                                                   )
'''
exchange_attempts = 5

simulation = ReplicaExchangeSampler(
        mcmc_moves=move,
        number_of_iterations=exchange_attempts,
        replica_mixing_scheme='swap-neighbors',
        )

reporter = MultiStateReporter('replex_output/1zip_hremd_test.nc', checkpoint_interval=1)
simulation.create(thermodynamic_states, sampler_states, reporter)
simulation.minimize()
simulation.equilibrate(300)

I've attached a notebook with input files.

Thank you for any help.

Dan omm_replex_sandbox.zip

Well, it appears that it was an equilibration issue. I reduced the ReplicaExchangeSampler timestep to 0.01 femtoseconds and also increased the collision rate to 100/ps for a short equilibration, then used those sampler states to start a new HREMD simulation with a 2 femtosecond timestep and 20/ps collision rate and things seem to be working.

The 1/ps collision rate in the above was too low, but even when I increased it, the 2 femtosecond timestep was too long even after the additional minimization. Doing 100 iterations of equilibration at the high collision rate and short timestep apparently solved my problem.

@mjw99, maybe the system will still be useful.

Thank you.

I have to backtrack on my previous comment. I still get the nan error.

I am applying lambdas to electrostatics, torsions, and sterics.

If I run a LangevinSplittingDynamicsMove where each replica has lambda = 1, the simulation doesn't crash.

If I run it with lambdas below ~0.9, it crashes with nan potential energy.

Importantly, if I do not apply the lambdas to sterics, the simulation does not crash.

Like @zhang-ivy, my simulation will run with lambda applied to the three terms if I use GHMCMove.

My goal is to run Rest2 with all protein atoms enhanced with the same scheme used by the Plumed implementation where electrostatics have sqrt(lambda) and sterics and torsions have the full lambda value applied.

choderalab / openmmtools

LangevinSplittingDynamicsMove sometimes nans #484