Closed mikemhenry closed 1 year ago
Looks like we have a mix of failure modes, some of which are related to the changes in the openff toolkit. It does look like we are getting some NaNs, I will work on fixing some of the other errors first to reduce the noise.
@mikemhenry As far as I can see, we shouldn't be pulling openff-toolkit>=0.11
with perses
conda-forge packages.
Oh I see, this is actually a problem in the dependencies of openmmforcefields=0.11.2
, which is the one that is being pulled and that one probably needs a more recent version of openff-units
. Should we open an issue about this in the openmmforcefields
repo?
As far as I can see, the problem of having the wrong openff-units
version only appears when we build the environment with openmm=7.6
, I don't know why this pulls the wrong version. Do we want to keep supporting this version of openmm?
1) I'm fine with dropping openmm 7.6, we are not using it and it will trim our matrix down/make debugging easier
2) We added functionality in openff-units
to make the toolkit more backwards compatible, but the edge case is you need the newer version of that package to use the older version of the toolkit in a backwards compatible way. I will try pulling in a newer version of openff-units and see if that helps to fix things. If not, this will be the time to make perses work with the new toolkit.
@ijpulidos that worked, now on nightly:
FAILED perses/tests/test_relative_setup.py::test_relative_setup_charge_change - AssertionError: charge diff is -1 but should be zero.
assert False
+ where False = <function isclose at 0x7f2fa0e79430>(-1, 0)
+ where <function isclose at 0x7f2fa0e79430> = <module 'numpy' from '/usr/share/miniconda/envs/test/lib/python3.8/site-packages/numpy/__init__.py'>.isclose
and openmm8 beta
FAILED perses/tests/test_cli.py::test_dummy_cli_with_override - AssertionError: assert 1 == 0
+ where 1 = <Result OpenMMException('Particle coordinate is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')>.exit_code
FAILED perses/tests/test_relative.py::test_RESTCapableHybridTopologyFactory_energies - AssertionError
FAILED perses/tests/test_relative.py::test_unsampled_endstate_energies - AssertionError
FAILED perses/tests/test_cli.py::test_s3_yaml_read - AssertionError: assert 1 == 0
+ where 1 = <Result OpenMMException('Particle coordinate is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')>.exit_code
FAILED perses/tests/test_relative_point_mutation_setup.py::test_PointMutationExecutor_endstate_validation - AssertionError
FAILED perses/tests/test_relative_setup.py::test_relative_setup_charge_change - AssertionError: charge diff is -1 but should be zero.
assert False
+ where False = <function isclose at 0x7f0e384ae680>(-1, 0)
+ where <function isclose at 0x7f0e384ae680> = <module 'numpy' from '/usr/share/miniconda/envs/test/lib/python3.10/site-packages/numpy/__init__.py'>.isclose
So everything is okay on openmm 7.7, on nightly we have the charge failure, and on beta we have charge failure + some NaNs + assert np.isclose([components_hybrid['NonbondedForce']], np.sum(nonbonded_other_values))
errors
Wow, how did that bug not get caught before?
Oh and let's allow codecov to fail again, it's been flakey for a few different repos I manage
I realized that the other failing tests, namely the test_cli.py::test_dummy_cli_with_override
, test_relative.py::test_RESTCapableHybridTopologyFactory_energies
, test_relative_point_mutation_setup.py::test_PointMutationExecutor_endstate_validation
, are all failing exclusively if we use the CPU platform. That is, if I use CUDA
platform they run fine, locally. We would need to further dig into these.
Differences in the test for nonbonded forces are as follows:
CPU platform:
Nonbondeds -- og: -52.1291542657432, hybrid: 69.5571606313417
CUDA platform:
Nonbondeds -- og: -50.75874077259308, hybrid: -50.758743334235376
That is really weird, do you see the same difference with openmm 7.7?
Another check is what the reference platform says, it is normally slow but I think for just evaluating forces it isn't bad for a small system, and it will tell us which is correct... What script are you using that generates that output @ijpulidos
Hmm that's odd.. I agree with Mike's suggestions though. Can you check if you are able to reproduce the problem on CPU (and not GPU) with OpenMM 7.7? If not, then we'll need to work with Peter to figure out what changed in OpenMM
@mikemhenry @zhang-ivy These results come from our own perses tests output. We don't see the same behavior with OpenMM 7.7 since tests are passing for that configuration. However, there are two things to be addressed here, one is the differences in the energies for nonbondeds (the test itself) and another is the differences between platforms (this is not covered by the tests).
In any case, the results for OpenMM 7.7 in different platforms are as follows (for test_relative.py::test_RESTCapableHybridTopologyFactory_energies
)
CPU Platform:
Nonbondeds -- og: -50.85930587590415, hybrid: -50.859318121238914
CUDA Platform:
Nonbondeds -- og: -50.75468720638197, hybrid: -50.75468719701384
So, as expected not only the test passes but it also seems to give consistent results between platforms for OpenMM 7.7.0
I think the difference across platforms is expected because I think PME is computed differently on CPU than it is on GPU -- right @jchodera?
If the tests are passing for 7.7 and not for 8 beta, we should loop Peter in to help us determine what's changed in openmm
If the tests are passing for 7.7 and not for 8 beta, we should loop Peter in to help us determine what's changed in openmm
Yes, I agree, but we probably would like to figure out a simple script that reproduces the problem using only openmm. @zhang-ivy Do you think it is worth trying to extract the relevant parts from test_relative.py::test_RESTCapableHybridTopologyFactory_energies
to achieve that?
Oh actually I just realized that the CPU tests that you are running use ala dipeptide in vacuum, so there shouldn't be a problem with PME
Is the assertion error being thrown during this call or is this call passing? https://github.com/choderalab/perses/blob/a95bf8d8c0be3b1a5cc5ff57a68365912fc086ff/perses/tests/test_relative.py#L1051
Also if it is the above call that is failling, is it failing for both endstates? We check both endstates here: https://github.com/choderalab/perses/blob/e1db2061eec75cceba9f29fa3e3d0a26a8156ed2/perses/tests/test_relative.py#L1016-L1017
Do you think it is worth trying to extract the relevant parts from test_relative.py::test_RESTCapableHybridTopologyFactory_energies to achieve that?
I'm not sure its going to be possible to extract out parts of the code such that it only relies on openmm since the whole point of the test is that it checks whether the old system energy matches the hybrid system energy at lambda = 0 and that the new system energy matches the hybrid system energy at lambda = 1
I tried to create a stripped down, reproducible example of the energy validation issue that we are seeing on CPU and am seeing strange behaviors related to importing from perses.app.relative_point_mutation_setup import PointMutationExecutor
.
When I run the attached script (test.py
) without importing from perses.app.relative_point_mutation_setup import PointMutationExecutor
, I see that the total potential energy and the energy of each force is equivalent across platforms:
# CPU platform
-10.396224412218231 kJ/mol
conducting subsequent work with the following platform: CPU
{'CustomBondForce': 0.4843042375448105, 'CustomAngleForce': 7.903113289767519, 'CustomTorsionForce': 26.631596091842702, 'CustomNonbondedForce_electrostatics': -134.40419159130172, 'CustomNonbondedForce_sterics': 4.716823990818269, 'CustomBondForce_exceptions': 90.50042600262788, 'NonbondedForce_reciprocal': 0.0, 'NonbondedForce_sterics': 0.0}
# Reference platform
-10.396223075365185 kJ/mol
conducting subsequent work with the following platform: Reference
{'CustomBondForce': 0.4843042375448105, 'CustomAngleForce': 7.903113289767519, 'CustomTorsionForce': 26.631596091842702, 'CustomNonbondedForce_electrostatics': -134.4041878467019, 'CustomNonbondedForce_sterics': 4.716820782173288, 'CustomBondForce_exceptions': 90.50042600262788, 'NonbondedForce_reciprocal': 0.0, 'NonbondedForce_sterics': 0.0}
When I run the attached script (test.py
) with importing from perses.app.relative_point_mutation_setup import PointMutationExecutor
, I see that the total potential energy differs across platforms and when inspecting the energy components, the discrepancy seems to come from the CustomNonbondedForce_electrostatics:
# CPU platform
324.8533636004771 kJ/mol
conducting subsequent work with the following platform: CPU
{'CustomBondForce': 0.4843042375448105, 'CustomAngleForce': 7.903113289767519, 'CustomTorsionForce': 26.631596091842702, 'CustomNonbondedForce_electrostatics': 0.0, 'CustomNonbondedForce_sterics': 4.716823990818269, 'CustomBondForce_exceptions': 90.50042600262788, 'NonbondedForce_reciprocal': 0.0, 'NonbondedForce_sterics': 0.0}
# Reference platform
-10.396223075365185 kJ/mol
conducting subsequent work with the following platform: Reference
{'CustomBondForce': 0.4843042375448105, 'CustomAngleForce': 7.903113289767519, 'CustomTorsionForce': 26.631596091842702, 'CustomNonbondedForce_electrostatics': -134.4041878467019, 'CustomNonbondedForce_sterics': 4.716820782173288, 'CustomBondForce_exceptions': 90.50042600262788, 'NonbondedForce_reciprocal': 0.0, 'NonbondedForce_sterics': 0.0}
For some reason, when we include the PointMutationExecutor
import, the CustomNonbondedForce_electrostatics
energy is zero on CPU. @ijpulidos Could you debug this further and try to figure out what's causing this?
Note that the energies are in kT for the energy components but they are in kJ/mol for the total potential energy.
@zhang-ivy Thanks so much for the streamlined testing example. I'll take a look into this!
The problem seems to have been solved with the changes in the conda-forge/label/openmm_dev
openmm8.0.0dev2
package.
Now the question is what actual package do we want to test against, since this PR uses the openmm_rc
label instead of the openmm_dev
(which I think has to be manually built).
We need a reliable way of building and testing against OpenMM release candidates, since these issues are all not reproducible in different OS (As far as I could tell the problem was apparently only for linux
and not macos
, which might point towards build issues for specific platforms).
I can't reproduce the above error even with 8.0.0beta. After uncommenting the import, I get this output:
-10.396224412218231 kJ/mol
conducting subsequent work with the following platform: CPU
{'CustomBondForce': 0.4843042375448105, 'CustomAngleForce': 7.903113289767519, 'CustomTorsionForce': 26.631596091842702, 'CustomNonbondedForce_electrostatics': -134.40419159130172, 'CustomNonbondedForce_sterics': 4.716823990818268, 'CustomBondForce_exceptions': 90.50042600262788, 'NonbondedForce_reciprocal': 0.0, 'NonbondedForce_sterics': 0.0}
-10.396223075365185 kJ/mol
conducting subsequent work with the following platform: Reference
{'CustomBondForce': 0.4843042375448105, 'CustomAngleForce': 7.903113289767519, 'CustomTorsionForce': 26.631596091842702, 'CustomNonbondedForce_electrostatics': -134.4041878467019, 'CustomNonbondedForce_sterics': 4.716820782173288, 'CustomBondForce_exceptions': 90.50042600262788, 'NonbondedForce_reciprocal': 0.0, 'NonbondedForce_sterics': 0.0}
This is on Ubuntu 20.04. Here's my environment.
I wonder if we are running into this -ffast-math
issue (see also this twitter thread).
It seems unlikely. Subnormals are a really esoteric part of the floating point spec. Nothing in OpenMM should depend on how they're treated.
From the dev sync, we want to be testing with the openmm8.0.0dev2
packages (under the openmm_dev
label, instead of the openmm_rc
).
I can't reproduce the above error even with 8.0.0beta. After uncommenting the import, I get this output:
@peastman can you please share the spec file for your environment? That can be obtained by running conda list --explicit > spec-file.txt
. I'm still getting weird results and I would like to check if it's something with the builds/versions of the packages in the environment. Thanks!
I just created a new environment with these commands.
mamba env create mmh/openmm-8-beta-linux
conda activate openmm-8-beta-linux
mamba install -c conda-forge perses cudatoolkit=11.6
The test works correctly. Here is the environment.
As far as I can tell, the error is triggered if we build an environment with conda
(which is what we use in CI), instead of mamba
. Creating the environment with mamba doesn't really trigger the issue. I don't really know why this is the case, though. Using a different package manager/solver shouldn't change the behavior or OpenMM.
The explicit environment that shows the issue is as follows
So are you saying with the same explicit environment file (which means no solving, just downloading), using conda triggers the issue, but using mamba does not?
So are you saying with the same explicit environment file (which means no solving, just downloading), using conda triggers the issue, but using mamba does not?
There's no direct way to test this because as far as I can tell using specific labels to install openmm doesn't really work with mamba
. What I mean is comparing creating the environment with mamba as follows (as per Peter's instructions)
mamba env create mmh/openmm-8-beta-linux
conda activate openmm-8-beta-linux
mamba install -c conda-forge perses cudatoolkit=11.6
(Doesn't show the issue)
Compared to installing/creating it with openmm
conda env create -n perses-openmm-8-beta-conda -f devtools/conda-envs/test_env.yaml
conda activate openmm-8-beta-conda
conda install -c conda-forge/label/openmm_rc openmm
conda install -c conda-forge -c openeye perses openeye-toolkits
(shows the issue)
Can you run conda list
for each of those environments, then look at the diff to see what is different?
What is the difference in the resulting environments?
Also conda env create mmh/openmm-8-beta-linux
Should work as well.
Can you run
conda list
for each of those environments, then look at the diff to see what is different?
Yes, I did that, tried the different builds and versions for openmm or openff or similar, but the issue was still showing for those that could be solved (sometimes the environment couldn't be solved). Another difference is that the environment shared by @peastman uses mkl
, whereas we don't use it in our CI environment, so that changes a lot of things starting from the lower level BLAS implementations (that's probably why the environment cannot be easily solved when going from one to the other).
What is the difference in the resulting environments?
Also
conda env create mmh/openmm-8-beta-linux
Should work as well.
Could this be used to create our CI environment? Because my main objective here is testing the CI workflow, rather than local tests on my workstation or on lilac.
Description
This PR adds the openmm 8 beta to our testing matrix.
Motivation and context
We want to make sure the new beta doesn't break anything!
How has this been tested?
Tested right here :)
Change log
I don't think we need one, but if we want one