Arkane Pdep Functional Test Fails Often

amarkpayne commented 5 years ago

Bug Description

The latest build of #1561 failed with the following error:

======================================================================

FAIL: A general test for a PDep job in Arkane

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/home/travis/build/ReactionMechanismGenerator/RMG-Py/arkane/pdepTest.py", line 104, in testPDepJob

    self.assertAlmostEquals(rxn.kinetics.getRateCoefficient(1000.0, 1.0), 88.88253229631246)

AssertionError: 88.8819863725482 != 88.88253229631246 within 7 places

This PR does not change any part of Arkane or RMG's pdep code though, and in fact this error only showed when I recently rebased this branch on current master. I notice that this test fails often when I run the functional tests locally, and it appears that the number for the rate coefficient appears to drift slightly for reasons unknown (it appears to happen whenever we add a new Arkane example or functional test, for example I believe this was why the most recent change was made to this unit test here.)

I also wonder if it is somehow related to the error with the most recent Chlorine DB branch build here, where the order of the reactants has changed when the functional tests are run in this build.

A going theory that I have is that this is related to Arkane's use of global and module variables, which might cause problems when we run all of the Arkane examples at once, but I am not sure.

In the meantime, we might want to consider re-writing this test. Thoughts?

How To Reproduce

Run the RMG functional tests locally. Check out the errorCancelingReactions branch if need be.

mjohnson541 commented 5 years ago

That's incorrect, the change in the multi-D torsions PR was made because we switched from calculating 1D HRs semi-classically to quantum mechanically. This impacts pressure dependent rate calculations because they are dependent through TST and thermochemistry on the 1D HR partition functions.

amarkpayne commented 5 years ago

That's incorrect, the change in the multi-D torsions PR was made because we switched from calculating 1D HRs semi-classically to quantum mechanically. This impacts pressure dependent rate calculations because they are dependent through TST and thermochemistry on the 1D HR partition functions.

I see, that makes sense. Any ideas what might be going on here? Should we rethink this line in this test i.e. should this test only check that the number is within the right ballpark (one or two decimal places? 1%?) if it will fail whenever we make improvements to Arkane?

mjohnson541 commented 5 years ago

As far as I can tell the test passes consistently on travis (or at least did while I was testing the nDTorsions stuff) and I really don't think we should be getting rid of or changing failing tests because we don't know why they fail (the blanket over the head approach to bugs). The way I see it some change in your code is causing Arkane to estimate pdep rate coefficients differently, so you should test at different commits and figure out what specific commit results in the estimation change and use that to figure out why this is happening.

amarkpayne commented 5 years ago

Update: the commit in question is this one here, as this commit fails with this error, but the one before it does not.

Again, I think this is interesting because I do not change any Arkane files except the two of my own that I have created in this PR, and at this point in the PR I had not yet integrated these files with the rest of Arkane.

In this commit though I do add a few new dependencies, so maybe that is the problem? It would seem strange--I will keep looking into this

mjohnson541 commented 5 years ago

k(T, P, # rmg dependencies)? XD

Break the commit up, find the smallest change that causes it...don't worry about test failures not in the pdep test.

amarkpayne commented 5 years ago

I am trying a few things like that, I'll let you know what happens. My thought with the dependency thing was what if the new dependencies require us to use other versions of other dependencies that pdep needs but these versions are not compatible with pdep.

amarkpayne commented 5 years ago

Okay, now we are getting somewhere: I created a branch that just added in the new dependencies, but otherwise has no new code additions and I get this failure (as a side not I also get the error with the unit test that is plaguing the Chlorine DB builds). You can check out the branch here.

Here is the diff of the conda list between master and this branch:

diff master.txt ecr_deps.txt 
6a7,8
> appdirs                   1.4.3            py27h28b3542_0
> 
19c21
< blas                      1.0                         mkl
---
> blas                      2.11                   openblas    conda-forge
42a45,46
> coincbc                   2.10.2               hab63836_0    rmg
> 
88a93,94
> glpk                      4.65                 h3ceedfd_2
> 
140a147,148
> libblas                   3.8.0               11_openblas    conda-forge
> 
142a151,152
> libcblas                  3.8.0               11_openblas    conda-forge
> 
152a163,168
> liblapack                 3.8.0               11_openblas    conda-forge
> 
> liblapacke                3.8.0               11_openblas    conda-forge
> 
> libopenblas               0.3.6                h5a2b251_1
> 
189,192d204
< mkl_fft                   1.0.12           py27ha843d7b_0
< 
< mkl_random                1.0.2            py27hd81dba3_0
< 
215c227
< numpy                     1.15.4           py27h7e9f1db_0
---
> numpy                     1.15.4           py27h99e49ec_0
217c229
< numpy-base                1.15.4           py27hde5b4d6_0
---
> numpy-base                1.15.4           py27h2f8d375_0
246a259,260
> ply                       3.11                     py27_0
> 
272a287,288
> pyomo                     5.6.6                    py27_0    rmg
> 
290a307,308
> pyutilib                  5.7.1                    py27_0    conda-forge
> 
309c327
< scikit-learn              0.20.3           py27hd81dba3_0
---
> scikit-learn              0.20.3           py27h22eb022_0
311c329
< scipy                     1.2.1            py27h7c811a0_0
---
> scipy                     1.2.1            py27he2b7bc3_0

mliu49 commented 5 years ago

I'm getting this error with the py3 testing branch and new py3 environment.

# packages in environment at /home/mjliu/anaconda2/envs/test:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_tflow_select             2.3.0                       mkl  
absl-py                   0.7.1                    py37_0  
algopy                    0.5.7                      py_0    rmg
astor                     0.8.0                    py37_0  
attrs                     19.1.0                   py37_1  
backcall                  0.1.0                    py37_0  
biopython                 1.74             py37h7b6447c_0  
blas                      1.0                         mkl  
bleach                    3.1.0                    py37_0  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.15.0            h7b6447c_1001  
ca-certificates           2019.5.15                     1  
cairo                     1.14.12              h8948797_3  
cairocffi                 0.9.0                    py37_0    rmg
cantera                   2.4.0               np115py37_2    cantera
cclib                     1.6.1.rmg        py37h39e3cac_0    rmg
certifi                   2019.6.16                py37_1  
cffi                      1.12.3           py37h2e261b9_0  
chemprop                  0.0.1                      py_0    rmg
click                     7.0                      py37_0  
coolprop                  6.2.1            py37hf484d3e_0    rmg
coverage                  4.5.3            py37h7b6447c_0  
cycler                    0.10.0                   py37_0  
cython                    0.29.13          py37he6710b0_0  
dbus                      1.13.6               h746ee38_0  
decorator                 4.4.0                    py37_1  
defusedxml                0.6.0                      py_0  
descriptastorus           2.0.0                      py_0    rmg
entrypoints               0.3                      py37_0  
expat                     2.2.6                he6710b0_0  
ffmpeg                    4.0                  hcdf2ecd_0  
flask                     1.1.1                      py_0  
fontconfig                2.13.0               h9420a91_0  
freetype                  2.9.1                h8a8886c_1  
fribidi                   1.0.5                h7b6447c_0  
future                    0.17.1                   py37_0  
gast                      0.2.2                    py37_0  
glib                      2.56.2               hd408876_0  
gmp                       6.1.2                h6c8ec71_1  
google-pasta              0.1.7                      py_0  
gprof2dot                 2017.9.19                  py_0    rmg
graphite2                 1.3.13               h23475e2_0  
graphviz                  2.40.1               h21bd128_2  
grpcio                    1.16.1           py37hf8bcb03_1  
gst-plugins-base          1.14.0               hbbd80ab_1  
gstreamer                 1.14.0               hb453b48_1  
gunicorn                  19.9.0                   py37_0  
h5py                      2.9.0            py37h7918eee_0  
harfbuzz                  1.8.8                hffaf4a1_0  
hdf5                      1.10.4               hb1b8bf9_0  
hyperopt                  0.1.2                      py_0    conda-forge
icu                       58.2                 h9c2bf20_1  
intel-openmp              2019.4                      243  
ipykernel                 5.1.2            py37h39e3cac_0  
ipython                   7.7.0            py37h39e3cac_0  
ipython_genutils          0.2.0                    py37_0  
ipywidgets                7.5.1                      py_0  
itsdangerous              1.1.0                    py37_0  
jedi                      0.15.1                   py37_0  
jinja2                    2.10.1                   py37_0  
joblib                    0.13.2                   py37_0  
jpeg                      9b                   h024ee3a_2  
jsonschema                3.0.2                    py37_0  
jupyter                   1.0.0                    py37_7  
jupyter_client            5.3.1                      py_0  
jupyter_console           6.0.0                    py37_0  
jupyter_core              4.5.0                      py_0  
keras-applications        1.0.8                      py_0  
keras-preprocessing       1.1.0                      py_1  
kiwisolver                1.1.0            py37he6710b0_0  
libboost                  1.67.0               h46d08c1_4  
libedit                   3.1.20181209         hc058e9b_0  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0  
libopus                   1.3                  h7b6447c_0  
libpng                    1.6.37               hbc83047_0  
libprotobuf               3.8.0                hd408876_0  
libsodium                 1.0.16               h1bed415_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.0.10               h2733197_2  
libuuid                   1.0.3                h1bed415_2  
libvpx                    1.7.0                h439df22_0  
libxcb                    1.13                 h1bed415_1  
libxml2                   2.9.9                hea5a465_1  
lpsolve                   5.5.2.5                       2    rmg
lpsolve55                 5.5              py37h39e3cac_2    rmg
markdown                  3.1.1                    py37_0  
markupsafe                1.1.1            py37h7b6447c_0  
matplotlib                3.1.0            py37h5429711_0  
mistune                   0.8.4            py37h7b6447c_0  
mkl                       2019.4                      243  
mkl-service               2.0.2            py37h7b6447c_0  
mkl_fft                   1.0.14           py37ha843d7b_0  
mkl_random                1.0.2            py37hd81dba3_0  
mock                      3.0.5                    py37_0  
mopac                     2017                          1    rmg
mpmath                    1.1.0                    py37_0  
nbconvert                 5.5.0                      py_0  
nbformat                  4.4.0                    py37_0  
ncurses                   6.1                  he6710b0_1  
networkx                  2.3                        py_0  
ninja                     1.9.0            py37hfd86e86_0  
nose                      1.3.7                    py37_2  
notebook                  6.0.0                    py37_0  
numdifftools              0.9.39                     py_0    rmg
numpy                     1.15.4           py37h7e9f1db_0  
numpy-base                1.15.4           py37hde5b4d6_0  
olefile                   0.46                     py37_0  
openbabel                 2.4.1            py37hf484d3e_7    rmg
openssl                   1.1.1c               h7b6447c_1  
pandas                    0.25.0           py37he6710b0_0  
pandoc                    2.2.3.2                       0  
pandocfilters             1.4.2                    py37_1  
pango                     1.42.4               h049681c_0  
parso                     0.5.1                      py_0  
patsy                     0.5.1                    py37_0  
pcre                      8.43                 he6710b0_0  
pexpect                   4.7.0                    py37_0  
pickleshare               0.7.5                    py37_0  
pillow                    6.1.0            py37h34e0f95_0  
pip                       19.2.2                   py37_0  
pixman                    0.38.0               h7b6447c_0  
prometheus_client         0.7.1                      py_0  
prompt_toolkit            2.0.9                    py37_0  
protobuf                  3.8.0            py37he6710b0_0  
psutil                    5.6.3            py37h7b6447c_0  
ptyprocess                0.6.0                    py37_0  
py-boost                  1.67.0           py37h04863e7_4  
pycparser                 2.19                     py37_0  
pydas                     1.0.2            py37h1c1f16f_4    rmg
pydot                     1.4.1                    py37_0  
pydqed                    1.0.1            py37h1c1f16f_3    rmg
pygments                  2.4.2                      py_0  
pymongo                   3.8.0            py37he6710b0_1  
pyparsing                 2.4.2                      py_0  
pyqt                      5.9.2            py37h05f1152_2  
pyrdl                     1.1.2            py37h14c3975_0    rmg
pyrsistent                0.14.11          py37h7b6447c_0  
python                    3.7.4                h265db76_1  
python-dateutil           2.8.0                    py37_0  
pytorch-cpu               1.1.0               py3.7_cpu_0    pytorch
pytz                      2019.2                     py_0  
pyyaml                    5.1.2            py37h7b6447c_0  
pyzmq                     18.1.0           py37he6710b0_0  
qt                        5.9.7                h5867ecd_1  
qtconsole                 4.5.4                      py_0  
quantities                0.12.3             pyh24bf2e0_0    rmg
rdkit                     2019.03.4.0      py37hc20afe1_1    rdkit
readline                  7.0                  h7b6447c_5  
scikit-learn              0.21.2           py37hd81dba3_0  
scipy                     1.3.1            py37h7c811a0_0  
send2trash                1.5.0                    py37_0  
setuptools                41.0.1                   py37_0  
sip                       4.19.8           py37hf484d3e_0  
six                       1.12.0                   py37_0  
sqlite                    3.29.0               h7b6447c_0  
statsmodels               0.10.1           py37hdd07704_0  
symmetry                  1.0.1                         0    rmg
tensorboard               1.14.0           py37hf484d3e_0  
tensorboardx              1.8                        py_0    conda-forge
tensorflow                1.14.0          mkl_py37h45c423b_0  
tensorflow-base           1.14.0          mkl_py37h7ce6ba3_0  
tensorflow-estimator      1.14.0                     py_0  
termcolor                 1.1.0                    py37_1  
terminado                 0.8.2                    py37_0  
testpath                  0.4.2                    py37_0  
tk                        8.6.8                hbc83047_0  
torchvision-cpu           0.3.0             py37_cuNone_1    pytorch
tornado                   6.0.3            py37h7b6447c_0  
tqdm                      4.32.1                     py_0  
traitlets                 4.3.2                    py37_0  
wcwidth                   0.1.7                    py37_0  
webencodings              0.5.1                    py37_1  
werkzeug                  0.15.5                     py_0  
wheel                     0.33.4                   py37_0  
widgetsnbextension        3.5.1                    py37_0  
wrapt                     1.11.2           py37h7b6447c_0  
xlwt                      1.3.0                    py37_0  
xz                        5.2.4                h14c3975_4  
yaml                      0.1.7                had09818_2  
zeromq                    4.3.1                he6710b0_3  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.3.7                h0b5b093_0

amarkpayne commented 5 years ago

@mliu49 are you getting 88.8819863725482 exactly as well or a different number? Interestingly, we first thought that this might be a BLAS issue, but the version hasn't changed in your environment.

mliu49 commented 5 years ago

No, I'm getting a larger value...

======================================================================
FAIL: A general test for a PDep job in Arkane
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mjliu/Code/RMG-Py/arkane/pdepTest.py", line 108, in testPDepJob
    self.assertAlmostEquals(rxn.kinetics.getRateCoefficient(1000.0, 1.0), 88.88253229631246)
AssertionError: 88.89036973288468 != 88.88253229631246 within 7 places (0.007837436572216916 difference)

mliu49 commented 4 years ago

As evidenced by the references in PRs, this error is appearing again, though we never figured out what was causing it the first time around.

In the case of #1840, the push build failed while the pr build passed. In comparing the two logs, the conda environments were identical, and the main difference (besides the push vs. pr) seemed to be the Travis worker version. 🤦‍♂

In the failing build, the worker version is 6.2.6, while in the in passing build, it's 6.2.5. In the failing push build for ReactionMechanismGenerator/RMG-database#367, the failing build is also on version 6.2.6. If this is an accurate correlation, then the ongoing build for #1829 (using version 6.2.6) should fail.

Differences in Travis worker versions affecting build success is not unprecedented, but the mechanism by which this particular test could be affected is not clear.

mliu49 commented 4 years ago

As additional data points, the builds for #1832 and the two recent builds of master ran on Travis worker 6.2.5 and all passed.

As predicted, the restarted build for #1829 failed...

amarkpayne commented 4 years ago

I am running on my local computer running just this test using pytest in the PyCharm debugger on

RMG-Py: b691b45dd65963a2ccbdf8b256b38b34814f1e18 RMG-database: e409c4

It fails as follows:

        # Test the generated network reaction
        dictionary = {'hydroperoxylvinoxy': Species().from_smiles('[CH2]C(=O)OO'),
                      'acetylperoxy': Species().from_smiles('CC(=O)O[O]')}
        with open(os.path.join(self.directory, 'chem.inp'), 'r') as chem:
            reaction_list = read_reactions_block(chem, dictionary)
        rxn = reaction_list[0]
        self.assertIsInstance(rxn.kinetics, Chebyshev)
>       self.assertAlmostEquals(rxn.kinetics.get_rate_coefficient(1000.0, 1.0), 88.88253229631246)
E       AssertionError: 88.8819863725482 != 88.88253229631246 within 7 places (0.0005459237642639891 difference)

This should help us debug by comparison

amarkpayne commented 4 years ago

As an update, I am working on comparing the failing branch above with the branch v2.5.0 (RMG-Py: 068ec113a7e0e6211e91bd0928de68d2b502ea0a) using the same version of the database and working in separate instances of PyCharm on my local computer. Note that the v2.5.0 is a python 2 branch, and those has different versions of our dependencies. However, this branch currently passes the tests.

With help from @mliu49, we have determined that job.K (the rate data for the pdep job) is different between these two branches. The difference is noticeable by sometimes as few as 4 significant figures in, causing enough of a difference that our chebyshev fits down the line are different.

amarkpayne commented 4 years ago

Update: The issue goes further back than the call to self.K = self.network.calculate_rate_coefficients(self.Tlist.value_si, self.Plist.value_si, self.method) in the PressureDependenceJob execute method. I noticed that Mcoll was different before this point, and upon further digging found that the density and sum of states for the species in the pdep network differ as well. This means that the error occurs at least as far back as line 270 in pdep.py self.initialize(), where we initialize the PressureDependenceJob. It could be further back than this, I am not sure.

Inside this initialize call though, we make a call to self.network.initialize(..), and it is here where we calculate the density of states. Digging into this a bit further, we end up in calculate_density_of_states on line 223 of configuration.pyx. Here I have confirmed that q_data differs between these two branches right before the call to log_q = scipy.interpolate.InterpolatedUnivariateSpline(t_data, np.log(q_data)). Before this, q_data is initialized as ones, and then modified from calls to mode.get_partition_function(t).

I made the following change to the code (because debugging cython is hard) to see if these calls differed. Here is the modification to lines 340 of configuration.pyx on current master:

for mode in modes:
    print('Mode partition function: {0}'.format(mode.get_partition_function(t)))
    q_data[i] = q_data[i] * mode.get_partition_function(t)
raise Exception

The excetpion is just so that it only prints the results from one loop. Here are the results:

Current py3 master:

Current v2.5.0:

Therefore, it appears that mode.get_partition_function returns different results. Again, this could go a lot further back (maybe the inputs to all of this are different e.g. something from the log files didn't get transferred over properly), but it at least starts to differ here

amarkpayne commented 4 years ago

I think I have finally isolated the source of the problem (although I have no idea what is going on). In torsion.pyx, we have a method HinderedRotor.solve_schrodinger_equation that makes a call E = scipy.linalg.eig_banded(H, lower=True, eigvals_only=True, overwrite_a_band=True) that appears to be the source of our problem. I first got tipped off onto this because self.energies differed between the two branches. However, I have found that the hamiltonian, H, is exactly the same on both branches, which is the only variable thing before printing the energies inside of this method shows that they differ.

I confirmed that the hamiltonians were the same by dumping the matrices as a list to a yaml file and diffing the yaml files produced by both branches. Interestingly, when I load in this hamiltonian from the yaml file into a jupyter notebook, I get the same energies regardless of which python environment (2 or 3) I use, but the energies are different from either of the energies RMG gets on either branch (no idea why).

The differences are not really close at all. Here is a brief summary of self.energies

python 3 RMG-Py (current master) from pytest in PyCharm: ...

python 2 RMG-Py (v2.5.0) from pytest in PyCharm: ...

python 3 rmg_env from a jupyter notebook, loading in the hamiltonian from yaml ...

python 2 rmg_env from a jupyter notebook, loading in the hamiltonian from yaml ...

mjohnson541 commented 4 years ago

Does setting overwrite_a_band=False impact the results?

amarkpayne commented 4 years ago

No, at least for the PyCharm cases I got the same answers as before for each branch

amarkpayne commented 4 years ago

This is the FORTRAN code that is ultimately being called. Given your past experience with FORTRAN, do you see anything interesting? Note that I am not sure which version of the code we use, so you might need to look elsewhere for the source of the correct version number. http://www.netlib.org/lapack/explore-html/d5/db8/chbevd_8f_source.html

mjohnson541 commented 4 years ago

I mean there is this bit:

 *
 *     Get machine constants.
 *
       safmin = slamch( 'Safe minimum' )
       eps = slamch( 'Precision' )
       smlnum = safmin / eps
       bignum = one / smlnum
       rmin = sqrt( smlnum )
       rmax = sqrt( bignum )

But I would think these would be the same for all modern machines.

amarkpayne commented 4 years ago

Okay we are getting somewhere now. setting overwrite_a_band=False didn't make a change, but requesting eigvals_only=False (and taking care with what is being returned) DOES change the results. Both branches in PyCharm now give different answers from before, but they still don't agree.

python 3 RMG-Py (current master) from pytest in PyCharm and requesting eigvals_only=False: ...

python 2 RMG-Py (v2.5.0) from pytest in PyCharm and requesting eigvals_only=False: ...

Note that this isn't too surprising, the LAPACK code says that it uses a different algorithm if you need to find the eigenvectors

mjohnson541 commented 4 years ago

Wait, Mark can you check the condition number on the Hamiltonian?

amarkpayne commented 4 years ago

That would indeed be the problem: np.linalg.cond(H) = 7.025714299685296e+20

Something to talk about in 10.34 :)

mjohnson541 commented 4 years ago

XD...well now we know...

amarkpayne commented 4 years ago

Nevermind. As we discussed offline, I forgot that H was formatted as a banded matrix. I created a sparse matrix from these diags (thanks @mjohnson541 for the suggestion) and got a condition number of 32239.525070193977, which is not too high. To get this, I used the following code (please verify that it makes sense).

diagonals = [H[0, :], H[1, :-1], H[2, :-2], H[3, :-3], H[4, :-4], H[5, :-5]]
A = scipy.sparse.diags(diagonals, [0, -1, -2, -3, -4, -5])
norm_A = scipy.sparse.linalg.norm(A)
norm_invA = scipy.sparse.linalg.norm(scipy.sparse.linalg.inv(A))
cond = norm_A*norm_invA

Note that H is the hamiltonian loaded from the YAML file, which is a 6x401 matrix formatted as a lower banded matrix

mjohnson541 commented 4 years ago

You should use numpy.linalg.cond instead. If the condition number is high the matrix inverse cannot be calculated accurately so its norm will be inaccurate. Check the eigenvalues it gives you after you construct it (I think you can calculate them without it taking too long, but I might be wrong) and compare with the ones you expect based on your past experiments that should tell us if we have the same matrix at least.

amarkpayne commented 4 years ago

I tried numpy.linalg.cond, but it did not like dealing with sparse matrices. My computer may be able to handle a 401x401 matrix, let me see

amarkpayne commented 4 years ago

Converting to a dense matrix and running np.linalg.cond yielded 1417.03

ReactionMechanismGenerator / RMG-Py

Arkane Pdep Functional Test Fails Often #1682

Bug Description

How To Reproduce