`LangevinSplittingDynamicsMove` much slower than `LangevinIntegrator`

hannahbrucemacdonald commented 4 years ago

For a protein mutation of SER to CYS, LangevinSplittingDynamicsMove is much slower than LangevinIntegrator in vacuum

hannahbrucemacdonald commented 4 years ago

The integrator currently used in our workflow is in setup_relative_calculation.py

mcmc.LangevinSplittingDynamicsMove(timestep=timestep,
    collision_rate=5.0 / unit.picosecond,
    n_steps=n_steps_per_move_application,
    reassign_velocities=False,
    n_restart_attempts=20,
    splitting="V R R R O R R R V",
    constraint_tolerance=1e-06)

jchodera commented 4 years ago

Thanks for posting this!

There are a number of reasons why this is happening, and it would be good to post a nontrivial system (e.g. from the JACS set) set of {system,state}.xml files to benchmark on so I can drill down on the cause.

Potential causes:

LangevinSplittingDynamicsMove uses openmmtools.integrators.LangevinIntegrator (which uses CustomIntegrator and implements BAOAB by default), which is known to be slower than the simtk.openmm.LangevinIntegrator (which uses a leapfrog form the the integrator and is hand-coded), partly because it involves more kernel launches and constraint steps: https://github.com/openmm/openmm/issues/1828 https://github.com/choderalab/openmmtools/issues/210 For small vacuum systems, this speed difference is enormous, because the kernel call latency dominates, but for larger solvated systems, this speed difference should be small.
We are likely using way too many R substeps and can go to V R O R V (shaving the timestep a bit if needed)
The constraint tolerance may be too tight, causing lots of constraint iterations at each one of these substeps.
Are we sure we are correctly configuring the platform precision to mixed? I notice the code has been moved to perses.utils.configure_platform() and that it tries to replace the global context cache object rather than reconfiguring the global context cache or passing in a specific context cache to use in samplers.py and multistate.py---are we sure this is actually configuring the precision?
We could be running into the constraint pathology that @hannahbrucemacdonald had seen before and @dominicrufa mentioned again recently (but never communicated that he had resolved)
We may be running into many restart attempts (capped at 20) at this timestep, but the logger should emit a warning to the logs if this is happening
At 5 fs timesteps, the pairlist also gets updated a lot more often, so a larger pairlist may also help.

Peter Eastman was able to narrow down the speed difference a bit with a BAOABLangevinIntegrator: See https://github.com/openmm/openmm/issues/2396#issuecomment-546547267 --- it's roughly a 10% speed difference. We may make this a leapfrog integrator, though: See https://github.com/openmm/openmm/issues/2532

If you can post {system,state}.xml I can easily investigate!

dominicrufa commented 4 years ago

small_molecule_speed.tar.gz

dominicrufa commented 4 years ago

^ I'm solvating a small molecule transformation (CCCO --> CCCS) and demonstrating that the splitting dynamics move is ~2 to 3 times slower than the LangevinIntegrator. This occurs in a CPU and GPU platform. the whole pipeline is run, and the hybrid system lives in the htf._hybrid_system

dominicrufa commented 4 years ago

hybrid_system.tar.gz this is the tarball of the hybrid system xml

dominicrufa commented 4 years ago

We could be running into the constraint pathology that @hannahbrucemacdonald had seen before and @dominicrufa mentioned again recently (but never communicated that he had resolved) (https://github.com/choderalab/openmmtools/blob/master/openmmtools/mcmc.py#L709) if this is happening

I haven't been able to solve the speed issue, but the .ipynb experiments in https://github.com/choderalab/perses/issues/629 suggests that it is a constraint issue. I am going to take another look at this again.

dominicrufa commented 4 years ago

^ I'm solvating a small molecule transformation (CCCO --> CCCS) and demonstrating that the splitting dynamics move is ~2 to 3 times slower than the LangevinIntegrator. This occurs in a CPU and GPU platform. the whole pipeline is run, and the hybrid system lives in the htf._hybrid_system

ah, and I'm seeing this in the master PR

jchodera commented 4 years ago

Please file an issue on the OpenMM tracker if you can put together a simple example to reproduce the constraints issue. See also https://github.com/openmm/openmm/issues/2380 and maybe post there?

dominicrufa commented 4 years ago

to clarify, the splitting integrator runs about ~2-3 times slower than LangevinIntegrator even when the constraints in the system are deleted. a minimal example demonstrating the speed difference is here: https://github.com/choderalab/perses/files/4183059/small_molecule_speed.tar.gz

jchodera commented 4 years ago

@dominicrufa : This is a super small system of ~1000 atoms. Can you also include a nontrivial example of ~25K atoms or more? Our understanding was that the speed difference between simtk.openmm.LangevinIntegrator and openmmtools.integrators.LangevinIntegrator may be large for small systems because of the kernel overhead latency, but this difference should be small for solvated systems of reasonable size.

jchodera commented 4 years ago

If you can post {system,state}.xml I can easily investigate!

@dominicrufa : All I need here is a system.xml, and a state.xml. Could you include both? None of your tarballs include state.xml.

jchodera commented 4 years ago

This may not be fully debugged yet, but this is the sort of test harness that would be useful for reporting to the OpenMM issue tracker (along with {system,state}.xml):

from simtk import unit, openmm
import openmmtools
import time
# pip install contexttimer
import contexttimer

def deserialize(filename):
    with open(filename, 'r') as infile:
        o = openmm.XmlSerializer.deserialize(infile.read())
    return o

system = deserialize('system.xml')
state = deserialize('state.xml')

def simulate(integrator_class):
    temperature = 300 * unit.kelvin
    collision_rate = 1.0 / unit.picoseconds
    timestep = 2.0 / unit.femtoseconds
    nsteps = 1000

    integrator = integrator_class(temperature, collision_rate, timestep)
    platform = openmm.Platform.getPlatformByName('CUDA')
    platform.setDefaultParmaeter('Precision', 'mixed')
    context = openmm.Context(system, integrator, platform)
    context.setPositions(state.getPositions())
    integrator.step(nsteps)

    with Timer() as t:
        integrator.step(nsteps)
    print(f'integration of {nsteps} steps with {integrator.__class__.__name__} took {t.elapsed} seconds')

    del context, integrator

simulate(openmm.LangevinIntegrator)
simulate(openmmtools.integrators.LangevinIntegrator)
simulate(openmm.BAOABIntegrator)
simulate(openmmtools.integrators.LangevinSplittingIntegrator)

jchodera commented 4 years ago

Also, isn't this completely outdated?!

molecule_setup = RelativeFEPSetup(ligand_input = 'test.smi', 
                                  old_ligand_index = 3, 
                                  new_ligand_index = 4, 
                                  forcefield_files = ['gaff.xml', 'amber14/protein.ff14SB.xml', 'amber14/tip3p.xml'], 
...

We no longer use gaff.xml in force field files since we switched to openmmforcefields.

dominicrufa commented 4 years ago

I'm using an older PR. I'll try to update the example with a state and system .xml by the end of the day.

jchodera commented 4 years ago

I'm using an older PR. I'll try to update the example with a state and system .xml by the end of the day.

Thanks! I can look at these tonight if you can send along a {system,state}.xml I can test with!

jchodera commented 4 years ago

@dominicrufa @hannahbrucemacdonald : Just a reminder to send along a {system,state}.xml so I can look into this!

dominicrufa commented 4 years ago

it turns out there is no issue here; there is, however, overhead in calling the LangevinSplittingDynamicsMove.apply() method...this is what was causing the slowdown in my test case. since every move application uses ~250 md steps, there is no slowdown as it is implemented in perses

choderalab / perses

`LangevinSplittingDynamicsMove` much slower than `LangevinIntegrator` #631