FireMinimizationIntegrator fails on CPU platforms

MKCarter commented 1 year ago

Hey,

I have been playing around with the tutorial notebook - but I see that it calls openeye_wrapper.pyfor molecule to SMILES conversion. As a non oechem license holder, I was just wondering if you plan to add an rdkit version to the code any time soon?

Thanks, Mike

richardjgowers commented 1 year ago

Hi Mike

We don't actually use OEChem for this (or anywhere in our code base). Everything we're doing is based on rdkit, I think the only oechem things we have are to_oechem() methods to generate an OEMol.

Regardings SMILES, if you've got a Component, you can just call .smiles and it will use Chem.MolToSmiles (from here: https://github.com/OpenFreeEnergy/gufe/blob/main/gufe/components/explicitmoleculecomponent.py#L80)

dwhswenson commented 1 year ago

We've now had another report of this issue (by email): @MKCarter, could you tell us the versions of openmmforcefields and of openff-toolkit that you have installed? It seems to be coming from one of those.

dwhswenson commented 1 year ago

This is a bug in openff-toolkit 0.11.2. It should be fixed by upgrading to openff-toolkit 0.11.4 (actually, I think 0.11.3 will fix as well). Issue described here: https://github.com/openforcefield/openff-toolkit/issues/1435; fixed here: https://github.com/openforcefield/openff-toolkit/pull/1436

richardjgowers commented 1 year ago

@MKCarter we've pushed some new conda-forge packages that resolve this. Running conda update -c conda-forge openfe should resolve this

MKCarter commented 1 year ago

@richardjgowers - That did fix the issue. However, I now see another issue with locating the openff-2.0.0.off.xml file see the error below:

`---------------------------------------------------------------------------
DefinitionSyntaxError                     Traceback (most recent call last)
File ~/.conda/envs/openfe/lib/python3.10/site-packages/openmmforcefields/generators/template_generators.py:1278, in SMIRNOFFTemplateGenerator.__init__(self, molecules, cache, forcefield)
   1277         filename += '.offxml'
-> 1278     self._smirnoff_forcefield = openff.toolkit.typing.engines.smirnoff.ForceField(filename)
   1279 except Exception as e:

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/typing/engines/smirnoff/forcefield.py:307, in ForceField.__init__(self, aromaticity_model, parameter_handler_classes, parameter_io_handler_classes, disable_version_check, allow_cosmetic_attributes, load_plugins, *sources)
    306 # Parse all sources containing SMIRNOFF parameter definitions
--> 307 self.parse_sources(sources, allow_cosmetic_attributes=allow_cosmetic_attributes)

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/typing/engines/smirnoff/forcefield.py:768, in ForceField.parse_sources(self, sources, allow_cosmetic_attributes)
    767 smirnoff_data = self.parse_smirnoff_from_source(source)
--> 768 self._load_smirnoff_data(
    769     smirnoff_data, allow_cosmetic_attributes=allow_cosmetic_attributes
    770 )

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/typing/engines/smirnoff/forcefield.py:876, in ForceField._load_smirnoff_data(self, smirnoff_data, allow_cosmetic_attributes)
    875 # Go through the whole SMIRNOFF data structure, trying to convert all strings to Quantity
--> 876 smirnoff_data = convert_all_strings_to_quantity(smirnoff_data)
    878 # Go through the subsections, delegating each to the proper ParameterHandler
    879 
    880 # Define keys which are expected from the spec, but are not parameter sections

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/utils/utils.py:233, in convert_all_strings_to_quantity(smirnoff_data)
    232 for key, value in smirnoff_data.items():
--> 233     smirnoff_data[key] = convert_all_strings_to_quantity(value)
    234 obj_to_return = smirnoff_data

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/utils/utils.py:233, in convert_all_strings_to_quantity(smirnoff_data)
    232 for key, value in smirnoff_data.items():
--> 233     smirnoff_data[key] = convert_all_strings_to_quantity(value)
    234 obj_to_return = smirnoff_data

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/utils/utils.py:233, in convert_all_strings_to_quantity(smirnoff_data)
    232 for key, value in smirnoff_data.items():
--> 233     smirnoff_data[key] = convert_all_strings_to_quantity(value)
    234 obj_to_return = smirnoff_data

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/utils/utils.py:238, in convert_all_strings_to_quantity(smirnoff_data)
    237 for index, item in enumerate(smirnoff_data):
--> 238     smirnoff_data[index] = convert_all_strings_to_quantity(item)
    239 obj_to_return = smirnoff_data

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/utils/utils.py:233, in convert_all_strings_to_quantity(smirnoff_data)
    232 for key, value in smirnoff_data.items():
--> 233     smirnoff_data[key] = convert_all_strings_to_quantity(value)
    234 obj_to_return = smirnoff_data

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/utils/utils.py:246, in convert_all_strings_to_quantity(smirnoff_data)
    245 try:
--> 246     obj_to_return = object_to_quantity(smirnoff_data)
    247 except (AttributeError, TypeError, SyntaxError):

File ~/.conda/envs/openfe/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/utils/utils.py:318, in _(obj)
    317 try:
--> 318     return string_to_quantity(obj)
    319 except pint.errors.UndefinedUnitError:

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openff/toolkit/utils/utils.py:200, in string_to_quantity(quantity_string)
    199 try:
--> 200     quantity = unit.Quantity(quantity_string)
    201 except (TokenError, UndefinedUnitError):

File ~/.conda/envs/openfe/lib/python3.10/site-packages/pint/facets/plain/quantity.py:209, in PlainQuantity.__new__(cls, value, units)
    208 ureg = SharedRegistryObject.__new__(cls)._REGISTRY
--> 209 inst = ureg.parse_expression(value)
    210 return cls.__new__(cls, inst)

File ~/.conda/envs/openfe/lib/python3.10/site-packages/pint/facets/plain/registry.py:1255, in PlainRegistry.parse_expression(self, input_string, case_sensitive, use_decimal, **values)
   1253 gen = tokenizer(input_string)
-> 1255 return build_eval_tree(gen).evaluate(
   1256     lambda x: self._eval_token(x, case_sensitive=case_sensitive, **values)
   1257 )

File ~/.conda/envs/openfe/lib/python3.10/site-packages/pint/pint_eval.py:114, in EvalTreeNode.evaluate(self, define_op, bin_op, un_op)
    113     raise DefinitionSyntaxError('missing binary operator "%s"' % op_text)
--> 114 left = self.left.evaluate(define_op, bin_op, un_op)
    115 return bin_op[op_text](left, self.right.evaluate(define_op, bin_op, un_op))

File ~/.conda/envs/openfe/lib/python3.10/site-packages/pint/pint_eval.py:114, in EvalTreeNode.evaluate(self, define_op, bin_op, un_op)
    113     raise DefinitionSyntaxError('missing binary operator "%s"' % op_text)
--> 114 left = self.left.evaluate(define_op, bin_op, un_op)
    115 return bin_op[op_text](left, self.right.evaluate(define_op, bin_op, un_op))

    [... skipping similar frames: EvalTreeNode.evaluate at line 114 (4 times)]

File ~/.conda/envs/openfe/lib/python3.10/site-packages/pint/pint_eval.py:114, in EvalTreeNode.evaluate(self, define_op, bin_op, un_op)
    113     raise DefinitionSyntaxError('missing binary operator "%s"' % op_text)
--> 114 left = self.left.evaluate(define_op, bin_op, un_op)
    115 return bin_op[op_text](left, self.right.evaluate(define_op, bin_op, un_op))

File ~/.conda/envs/openfe/lib/python3.10/site-packages/pint/pint_eval.py:120, in EvalTreeNode.evaluate(self, define_op, bin_op, un_op)
    119 if op_text not in un_op:
--> 120     raise DefinitionSyntaxError('missing unary operator "%s"' % op_text)
    121 return un_op[op_text](self.left.evaluate(define_op, bin_op, un_op))

DefinitionSyntaxError: missing unary operator "*"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[36], line 4
      1 # solvent dry-run
      2 solvent_unit = list(solvent_dag.protocol_units)[0]
----> 4 solvent_unit.run(dry=True, verbose=True)

File ~/DD_tools/openfe/openfe/protocols/openmm_rbfe/equil_rbfe_methods.py:840, in RelativeLigandTransformUnit.run(self, dry, verbose, basepath)
    837 stateB_openff_ligand = stateB['ligand'].to_openff()
    839 #  1. Get smirnoff template generators
--> 840 smirnoff_stateA = SMIRNOFFTemplateGenerator(
    841     forcefield=settings.topology_settings.forcefield['ligand'],
    842     molecules=[stateA_openff_ligand],
    843 )
    845 smirnoff_stateB = SMIRNOFFTemplateGenerator(
    846     forcefield=settings.topology_settings.forcefield['ligand'],
    847     molecules=[stateB_openff_ligand],
    848 )
    850 # 2. Create forece fields and register them
    851 #  a. state A

File ~/.conda/envs/openfe/lib/python3.10/site-packages/openmmforcefields/generators/template_generators.py:1281, in SMIRNOFFTemplateGenerator.__init__(self, molecules, cache, forcefield)
   1279 except Exception as e:
   1280     _logger.error(e)
-> 1281     raise ValueError(f"Can't find specified SMIRNOFF force field ({forcefield}) in install paths")
   1283 # Delete constraints, if present
   1284 if 'Constraints' in self._smirnoff_forcefield._parameter_handlers:

ValueError: Can't find specified SMIRNOFF force field (openff-2.0.0.offxml) in install paths`

Any suggestions on how to remedy this?

richardjgowers commented 1 year ago

@MKCarter this might be because you haven't installed them, can you check the debug here? https://github.com/openmm/openmmforcefields#using-the-open-force-field-initiative-smirnoff-small-molecule-force-fields

Or more generally, how have you installed your environment so far?

MKCarter commented 1 year ago

Hi @richardjgowers

So I have installed openfe using conda - conda install -c conda-forge openfe

I have checked and the libraries are definitely installed as part of the package: print(SMIRNOFFTemplateGenerator.INSTALLED_FORCEFIELDS)

Gives me this output:

['smirnoff99Frosst-1.0.2', 'smirnoff99Frosst-1.0.0', 'smirnoff99Frosst-1.1.0', 'smirnoff99Frosst-1.0.4', 'smirnoff99Frosst-1.0.8', 'smirnoff99Frosst-1.0.6', 'smirnoff99Frosst-1.0.3', 'smirnoff99Frosst-1.0.1', 'smirnoff99Frosst-1.0.5', 'smirnoff99Frosst-1.0.9', 'smirnoff99Frosst-1.0.7', 'ff14sb_off_impropers_0.0.2', 'ff14sb_off_impropers_0.0.1', 'ff14sb_off_impropers_0.0.3', 'openff-1.0.1', 'openff-1.1.1', 'openff-1.0.0-RC1', 'openff-1.2.0', 'openff-1.3.0', 'openff-2.0.0-rc.2', 'openff-2.0.0', 'openff-1.1.0', 'openff-1.0.0', 'openff-1.0.0-RC2', 'openff-1.3.1', 'openff-1.2.1', 'openff-1.3.1-alpha.1', 'openff-2.0.0-rc.1']

So it looks like it just can't find the installed ff's - is there anywhere that I need to specify the path? I know where the path is to the files - on my setup it is located conda/envs/openfe/lib/python3.10/site-packages/openmmforcefields/ffxml/offxml

Interestingly if I try to load the openff forcefield directly it seems to find it OK.

forcefield = ForceField('offxml/openff-2.0.0.offxml')

Runs and loads the forcefield, but running this causes the same error message as above: smirnoff = SMIRNOFFTemplateGenerator(molecules=molecule)

Many thanks

IAlibay commented 1 year ago

Sorry about the long delay for responses here, could you provide a full list of packages installed in your environment @MKCarter ? This definitely shouldn't be happening.

IAlibay commented 1 year ago

Hi @MKCarter - Just wanted to catch up here. We've done lots of changes with the v0.7, could you maybe retry with a fresh install?

MKCarter commented 1 year ago

Hi @IAlibay -

I can confirm that updating to 0.7.1 has fixed the problem. I can now get to the stage of running the rbfe simulations. However, I keep getting OpenMMException: Particle coordinate is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan when running any of the examples (e.g. T4-lysosyme/JNK1)

Below is the full results.json from the JNK1 simualtion: {'estimate': None, 'uncertainty': None, 'protocol_result': {'data': {'nc_files': []}, '__qualname__': 'RelativeHybridTopologyProtocolResult', '__module__': 'openfe.protocols.openmm_rfe.equil_rfe_methods'}, 'unit_results': {'ProtocolUnitFailure-645f6f50722848e088b24de882a5051b': {'name': '18629-1 18634-1 repeat 0 generation 0', '_key': 'ProtocolUnitFailure-645f6f50722848e088b24de882a5051b', 'source_key': 'RelativeHybridTopologyProtocolUnit-da356b04c1ee4325a412553c9d7e28be', 'inputs': {'stateA': {':gufe-key:': 'ChemicalSystem-fff42b335217ea56b408a90cc5070c6e'}, 'stateB': {':gufe-key:': 'ChemicalSystem-25cee4b7c38c376455f436949b091dd0'}, 'ligandmapping': {':gufe-key:': 'LigandAtomMapping-d1c9f5028d101469bf0b4b7abd0a4036'}, 'settings': RelativeHybridTopologyProtocolSettings(forcefield_settings=OpenMMSystemGeneratorFFSettings(constraints='hbonds', rigid_water=True, remove_com=False, hydrogen_mass=4.0, forcefields=['amber/ff14SB.xml', 'amber/tip3p_standard.xml', 'amber/tip3p_HFE_multivalent.xml', 'amber/phosaa10.xml'], small_molecule_forcefield='openff-2.0.0'), thermo_settings=ThermoSettings(temperature=<Quantity(298.15, 'kelvin')>, pressure=<Quantity(0.986923267, 'standard_atmosphere')>, ph=None, redox_potential=None), system_settings=SystemSettings(nonbonded_method='PME', nonbonded_cutoff=<Quantity(1.0, 'nanometer')>), solvation_settings=SolvationSettings(solvent_model='tip3p', solvent_padding=<Quantity(1.2, 'nanometer')>), alchemical_settings=AlchemicalSettings(lambda_functions='default', lambda_windows=11, unsampled_endstates=False, use_dispersion_correction=False, softcore_LJ_v2=True, softcore_electrostatics=True, softcore_alpha=0.85, softcore_electrostatics_alpha=0.3, softcore_sigma_Q=1.0, interpolate_old_and_new_14s=False, flatten_torsions=False), alchemical_sampler_settings=AlchemicalSamplerSettings(online_analysis_interval=None, n_repeats=1, sampler_method='repex', online_analysis_target_error=<Quantity(0.1, 'boltzmann_constant * kelvin')>, online_analysis_minimum_iterations=500, flatness_criteria='logZ-flatness', gamma0=1.0, n_replicas=11), engine_settings=OpenMMEngineSettings(compute_platform=None), integrator_settings=IntegratorSettings(timestep=<Quantity(4, 'femtosecond')>, collision_rate=<Quantity(1.0, '1 / picosecond')>, n_steps=<Quantity(250, 'timestep')>, reassign_velocities=False, splitting='V R O R V', n_restart_attempts=20, constraint_tolerance=1e-06, barostat_frequency=<Quantity(25, 'timestep')>), simulation_settings=SimulationSettings(equilibration_length=<Quantity(6, 'picosecond')>, production_length=<Quantity(6, 'picosecond')>, forcefield_cache=None, minimization_steps=5000, output_filename='simulation.nc', output_indices='all', checkpoint_interval=<Quantity(250, 'timestep')>, checkpoint_storage='checkpoint.nc')), 'repeat_id': 0, 'generation': 0}, 'outputs': {}, 'exception': ['OpenMMException', ['Particle coordinate is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan']], 'traceback': 'Traceback (most recent call last):\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/gufe/protocols/protocolunit.py", line 286, in execute\n outputs = self._execute(context, **inputs)\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py", line 725, in _execute\n outputs = self.run(basepath=ctx.shared)\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py", line 688, in run\n sampler.minimize(max_iterations=sim_settings.minimization_steps)\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/openmmtools/utils/utils.py", line 95, in _wrapper\n return func(*args, **kwargs)\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/openmmtools/multistate/multistatesampler.py", line 633, in minimize\n minimized_positions, sampler_state_ids = mpiplus.distribute(self._minimize_replica, range(self.n_replicas),\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/mpiplus/mpiplus.py", line 512, in distribute\n all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/mpiplus/mpiplus.py", line 512, in <listcomp>\n all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/openmmtools/multistate/multistatesampler.py", line 1387, in _minimize_replica\n raise e\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/openmmtools/multistate/multistatesampler.py", line 1380, in _minimize_replica\n integrator.step(max_iterations)\n File "/Users/michaelcarter/opt/miniconda3/envs/openfe/lib/python3.10/site-packages/openmm/openmm.py", line 13400, in step\n return _openmm.CustomIntegrator_step(self, steps)\nopenmm.OpenMMException: Particle coordinate is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan\n', '__qualname__': 'ProtocolUnitFailure', '__module__': 'gufe.protocols.protocolunit'}}}

Is it possible that the system need to be minimised? Or could this be related to the timesteps being too large? Cheers, Mike

IAlibay commented 1 year ago

@MKCarter that is indeed strange, is this a publicly available version of the JNK1 set? I can try to get it running on my end.

The protocol should be sufficiently minimizing the system (both with a short CPU run to avoid NaNs and then GPU if you're using it). If the input is really bad it won't resolve that, but hopefully JNK1 should be a well defined input already.

MKCarter commented 1 year ago

Hi @IAlibay - Yeah, this is the JNK1 example that is found on the openfreenergy binder example notebooks. I also obtain the same NaN errors for the T4-Lysosyme example. The errors appear running via the example jupyter-notebooks on binder and also locally on my machine. I agree, this is a bit odd.

IAlibay commented 1 year ago

Thanks for confirming @MKCarter - I'll try to see if I can replicate the issue on my end. Are you able to let us know what type of machine you're running this on? Specifically any GPU type you're using might be of help here (e.g. we've really not validated this on OpenCL).

MKCarter commented 1 year ago

@IAlibay - I am running this on my 2021 M1 Mac - I have no GPU's so I am only using CPU.

MacBook Pro (16-inch, 2021) Chip Apple M1 Pro Memory 16GB

I see the code does look for an nvidia GPU at the start of the dry runs and the real simulations themselves, but it looks like it triues to run on CPU after failing to find an nvidia GPU. See below for the full output of the code when the Nan is produced:

`# complex dry-run
complex_unit = list(complex_dag.protocol_units)[0]

complex_unit.run(dry=True, verbose=True)
/bin/sh: 1: nvidia-smi: not found
Please cite the following:

        Friedrichs MS, Eastman P, Vaidyanathan V, Houston M, LeGrand S, Beberg AL, Ensign DL, Bruns CM, and Pande VS. Accelerating molecular dynamic simulations on graphics processing unit. J. Comput. Chem. 30:864, 2009. DOI: 10.1002/jcc.21209
        Eastman P and Pande VS. OpenMM: A hardware-independent framework for molecular simulations. Comput. Sci. Eng. 12:34, 2010. DOI: 10.1109/MCSE.2010.27
        Eastman P and Pande VS. Efficient nonbonded interactions for molecular dynamics on a graphics processing unit. J. Comput. Chem. 31:1268, 2010. DOI: 10.1002/jcc.21413
        Eastman P and Pande VS. Constant constraint matrix approximation: A robust, parallelizable constraint method for molecular simulations. J. Chem. Theor. Comput. 6:434, 2010. DOI: 10.1021/ct900463w
        Chodera JD and Shirts MR. Replica exchange and expanded ensemble simulations as Gibbs multistate: Simple improvements for enhanced mixing. J. Chem. Phys., 135:194110, 2011. DOI:10.1063/1.3660669

---------------------------------------------------------------------------
OpenMMException                           Traceback (most recent call last)
Cell In[35], line 6
      3 complex_path.mkdir()
      5 # First the complex transformation
----> 6 complex_dag_results = execute_DAG(complex_dag, scratch_basedir=complex_path, shared_basedir=complex_path)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/gufe/protocols/protocoldag.py:415, in execute_DAG(protocoldag, shared_basedir, scratch_basedir, keep_shared, keep_scratch, raise_error)
    411 context = Context(shared=shared,
    412                   scratch=scratch)
    414 # execute
--> 415 result = unit.execute(
    416         context=context,
    417         raise_error=raise_error,
    418         **inputs)
    420 if not keep_scratch:
    421     shutil.rmtree(scratch)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/gufe/protocols/protocolunit.py:286, in ProtocolUnit.execute(self, context, raise_error, **inputs)
    283 result: Union[ProtocolUnitResult, ProtocolUnitFailure]
    285 try:
--> 286     outputs = self._execute(context, **inputs)
    287     result = ProtocolUnitResult(
    288         name=self.name, source_key=self.key, inputs=inputs, outputs=outputs
    289     )
    291 except Exception as e:

File /srv/conda/envs/notebook/lib/python3.9/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py:725, in RelativeHybridTopologyProtocolUnit._execute(self, ctx, **kwargs)
    722 def _execute(
    723     self, ctx: gufe.Context, **kwargs,
    724 ) -> dict[str, Any]:
--> 725     outputs = self.run(basepath=ctx.shared)
    727     return {
    728         'repeat_id': self._inputs['repeat_id'],
    729         'generation': self._inputs['generation'],
    730         **outputs
    731     }

File /srv/conda/envs/notebook/lib/python3.9/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py:688, in RelativeHybridTopologyProtocolUnit.run(self, dry, verbose, basepath)
    685 if verbose:
    686     logger.info("minimizing systems")
--> 688 sampler.minimize(max_iterations=sim_settings.minimization_steps)
    690 # equilibrate
    691 if verbose:

File /srv/conda/envs/notebook/lib/python3.9/site-packages/openmmtools/utils.py:94, in with_timer.<locals>._with_timer.<locals>._wrapper(*args, **kwargs)
     91 @functools.wraps(func)
     92 def _wrapper(*args, **kwargs):
     93     with time_it(task_name):
---> 94         return func(*args, **kwargs)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/openmmtools/multistate/multistatesampler.py:634, in MultiStateSampler.minimize(self, tolerance, max_iterations)
    629 logger.debug("Minimizing all replicas...")
    631 # Distribute minimization across nodes. Only node 0 will get all positions.
    632 # The other nodes, only need the positions that they use for propagation and
    633 # computation of the energy matrix entries.
--> 634 minimized_positions, sampler_state_ids = mpiplus.distribute(self._minimize_replica, range(self.n_replicas),
    635                                                             tolerance, max_iterations,
    636                                                             send_results_to=0)
    638 # Update all sampler states. For non-0 nodes, this will update only the
    639 # sampler states associated to the replicas propagated by this node.
    640 for sampler_state_id, minimized_pos in zip(sampler_state_ids, minimized_positions):

File /srv/conda/envs/notebook/lib/python3.9/site-packages/mpiplus/mpiplus.py:512, in distribute(task, distributed_args, send_results_to, propagate_exceptions_to, sync_nodes, group_size, *other_args, **kwargs)
    510 if get_mpicomm() is None:
    511     logger.debug('Running {} serially.'.format(task.__name__))
--> 512     all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
    513     if send_results_to == 'all':
    514         return all_results

File /srv/conda/envs/notebook/lib/python3.9/site-packages/mpiplus/mpiplus.py:512, in <listcomp>(.0)
    510 if get_mpicomm() is None:
    511     logger.debug('Running {} serially.'.format(task.__name__))
--> 512     all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
    513     if send_results_to == 'all':
    514         return all_results

File /srv/conda/envs/notebook/lib/python3.9/site-packages/openmmtools/multistate/multistatesampler.py:1390, in MultiStateSampler._minimize_replica(self, replica_id, tolerance, max_iterations)
   1388         openmm.LocalEnergyMinimizer.minimize(context, tolerance, max_iterations)
   1389     else:
-> 1390         raise e
   1392 # Get the minimized positions.
   1393 sampler_state.update_from_context(context)

File /srv/conda/envs/notebook/lib/python3.9/site-packages/openmmtools/multistate/multistatesampler.py:1383, in MultiStateSampler._minimize_replica(self, replica_id, tolerance, max_iterations)
   1381     else:
   1382         logger.debug('Using FIRE: tolerance {} max_iterations {}'.format(tolerance, max_iterations))
-> 1383         integrator.step(max_iterations)
   1384 except Exception as e:
   1385     if str(e) == 'Particle coordinate is nan':

File /srv/conda/envs/notebook/lib/python3.9/site-packages/openmm/openmm.py:13583, in CustomIntegrator.step(self, steps)
  13573 def step(self, steps):
  13574     r"""
  13575     step(self, steps)
  13576     Advance a simulation through time by taking a series of time steps.
   (...)
  13581         the number of time steps to take
  13582     """
> 13583     return _openmm.CustomIntegrator_step(self, steps)

OpenMMException: Particle coordinate is NaN.  For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan`

Thanks

IAlibay commented 1 year ago

Ah I wonder the M1 might be the issue - @richardjgowers or @dwhswenson can you reproduce this on your machines?

dwhswenson commented 1 year ago

It's also happening in Binder, so the problem is not an issue of M1. It may be an issue of not having CUDA around, though.

Working on the CLI tutorial (RHFE) in Binder I get a similar behavior (not sure why this is such a different code path, but same error on minimization):

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/gufe/protocols/protocolunit.py", line 286, in execute
    outputs = self._execute(context, **inputs)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py", line 725, in _execute
    outputs = self.run(basepath=ctx.shared)
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py", line 671, in run
    sampler.setup(
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/openfe/protocols/openmm_rfe/_rfe_utils/multistate.py", line 123, in setup
    minimize(compound_thermostate_copy, sampler_state,
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/openfe/protocols/openmm_rfe/_rfe_utils/multistate.py", line 293, in minimize
    openmm.LocalEnergyMinimizer.minimize(
  File "/srv/conda/envs/notebook/lib/python3.9/site-packages/openmm/openmm.py", line 4388, in minimize
    return _openmm.LocalEnergyMinimizer_minimize(context, tolerance, maxIterations)
openmm.OpenMMException: Particle coordinate is NaN.  For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan

Also I think @MKCarter's trace may be from the showcase notebook. I'm seeing something similar there.

dwhswenson commented 1 year ago

Got it: Problem is OpenMMTools and OpenMM aren't communicating well. The exception reads: Particle coordinate is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan

OpenMMTools is checking against a specific string (which is no longer valid) to handle this error case:

        except Exception as e:
            if str(e) == 'Particle coordinate is nan':
                logger.debug('NaN encountered in FIRE minimizer; falling back to L-BFGS after resetting positions')
                sampler_state.apply_to_context(context)
                openmm.LocalEnergyMinimizer.minimize(context, tolerance, max_iterations)
            else:
                raise e

(https://github.com/choderalab/openmmtools/blob/47807bba462a7ac91a7da3075587629bb647fc6a/openmmtools/multistate/multistatesampler.py#L1381-L1387)

Opened a PR in OpenMMTools to fix. Not sure what else we can do from our end? (Pin OpenMM version to whenever the text changed?)

IAlibay commented 1 year ago

We do a pre minimisation ahead of using the FIRE minimizer, if we're hitting a NaN there something went wrong

MKCarter commented 1 year ago

@dwhswenson - yeah, these examples are all from the showcase notebooks. I can confirm these errors are reproducible on my local M1 Mac and binder. Great to hear you have the root cause identified! Thank you both for looking into this!

IAlibay commented 1 year ago

To follow-up here, @dwhswenson's observation is correct, although it seems like we were relying on that error catching to hide a deeper issue with the OpenMMTools FIREMinimizationIntegrator (see: https://github.com/choderalab/openmmtools/issues/686).

Looks like a new release of OpenMMTools will be out soon that will include @dwhswenson's fix. For now that should be ok (since the default is to just use LocalEnergyMinimizer). Longer term it looks like we'll be ditching the FIRE minimizer altogether.

MKCarter commented 1 year ago

@IAlibay - I can confirm that updating openmmtools and openfe appears to have solved the issue. Many thanks for all your help with this.

IAlibay commented 1 year ago

That's amazing to hear @MKCarter !

I'm going to close this issue, but do please keep us updated on any further issues (or positive outcomes if you're willing to share!). We're very keen to get this from development code to fully working in folk's hands.

MKCarter commented 1 year ago

Of course, anything that I can share, I will. This is a really cool initiative and I am looking forward to see how it develops 👍 - if I can help in anyway that is a bonus.

OpenFreeEnergy / openfe

FireMinimizationIntegrator fails on CPU platforms #253