~7x increase in runtime from 0.15 to 0.16 for explicit calculations

spadavec commented 7 years ago

Running the same calculation (e.g. same inputs and yaml files) using yank 0.15.2.dev0 vs 0.16.1 causes the total runtime to increase from ~2.5 days to ~14 days. I've looked through the yaml documentation, and I don't see anything that could have changed from version to version (with the exception of how MPI runs are conducted, but this calculation was done in serial), but have a feeling I'm wrong there. I've attached my input files here; am I missing something very obvious, or are the estimated time-to-completions in 0.16 not accurate?

Note: Within the zipfile, the nohup.out is the (limited) output from the 0.16 run, and I'm using the same hardware for both runs (1080 gtx, i7, ~50GB memory)

4LUC.zip.tar.gz

jchodera commented 7 years ago

Yikes. We've been tracking down some issues on our end where some recent changes caused huge slowdowns, so it's good to have another system where this is happening.

@lnaden is working through these performance issues one by one.

@lnaden: Can you dig into this and see if there is anything in addition to the performance issues you are already dealing with? It would also help to create individual tickets for the performance issues we've already identified so we can both better prioritize and gather more information from users.

Lnaden commented 7 years ago

Sure. I'll start marking each issue as I find it

andrrizzi commented 7 years ago

I think I figured out the main problem, which is related to the ContextCache configuration. YamlBuilder leaves the default time to live (50 accesses) and sets the capacity of the cache to 3 (1 for the alchemical states and 2 for the expanded states), but during minimization we create a 4th Context with the default integrator (gBAOAB) which is different than the one used to propagate (Langevin). Two things happen:

1) When the number of states is big, the two Contexts cached for the expanded cutoff states expire and are deleted during propagation. 2) When choosing a context for displacement and rotation moves (which don't require any specific integrator), ContextCache alternatively picks the Context with the Langevin integrator, and the one with the gBAOAB integrator created for the minimization. This keeps the latter Context in memory, which sometimes causes the deletion of useful Contexts.

We need to 1) increase the time_to_live of the context cache, 2) fix the ContextCache to prefer non-default Contexts (I'll shortly open an issue on openmmtools about this).

Hopefully this will fix it.

jchodera commented 7 years ago

3 seems like a small number for the default context cache size.

What about defaulting to no limit on size until we run into a GPU out of memory error?

andrrizzi commented 7 years ago

When an experiment is built by YamlBuilder, 3 should be the exact size we need. You need a higher number of Contexts only when you modify the experiments through the API before running them, and at that point you can change the configuration of the global context cache.

Handling the segfaults caused by out of memory errors and automatic sizing would be nice though! I'll open an issue in openmmtools, but for now I think we're stuck with trying to prevent errors.

jchodera commented 7 years ago

When an experiment is built by YamlBuilder, 3 should be the exact size we need.

The ContextCache is going to be used much more broadly than just YANK. My guess is that someone who would want to go through the trouble of using an LRUCache would probably want it to cache a bit more than three contexts, or to have it figure out how many can fit on the GPU automatically (but trying to keep caching until it fails).

andrrizzi commented 7 years ago

TheContextCache is going to be used much more broadly than just YANK.

Sure! I'm not saying that the default capacity should be 3, but that YamlBuilder sets it to 3 before running the experiment because it doesn't need more (see here).

That number is right now application-dependent and the global cache should be always configured. I agree though that we want to handle that automatically. Only that's an extra feature that is not implemented right now. I'll open an issue in openmmtools.

jchodera commented 7 years ago

Sure! I'm not saying that the default capacity should be 3, but that YamlBuilder sets it to 3 before running the experiment because it doesn't need more (see here).

Aha! Thanks for the clarification!

jchodera commented 7 years ago

@andrrizzi : Do you think https://github.com/choderalab/openmmtools/pull/235 addressed this speed issue?

andrrizzi commented 7 years ago

It should! At least a big chunk of it.

Lnaden commented 7 years ago

@spadavec We will be cutting a release for YANK and OpenMMTools on Friday (7/7/17) with the fixes to hopefully resolve this issue. If you don't want to wait, you can pull down the current GitHub versions and start testing today.

spadavec commented 7 years ago

@Lnaden I should be free tomorrow to try the new release and will post results as soon as I get them. Thanks for the update!

Lnaden commented 7 years ago

@spadavec We have just published the packages, here within the hour, they should be available through conda. Feel free to download them and test when you can!

Thanks for the patience as we have gotten these fixes in and let us know if this gets your simulations back up to speed!

spadavec commented 7 years ago

@Lnaden updated to yank 0.16.2 and openmm to 7.2.0 using conda, and it looks like I'm getting roughly the same runtime (13d 21h (0.16.2) vs. 14d 5h (0.16.1)) (worth noting that I do very much appreciate the ~9x speed up in replica minimization though!)

jchodera commented 7 years ago

Can you double check the conda update also updated openmmtools to 0.11.2?

spadavec commented 7 years ago

openmmtools is 0.11.2

yanker@hopper:~/yank_files/kras$ conda list 
# packages in environment at /home/yanker/miniconda2:
#
alabaster                 0.7.9                    py27_0  
alchemy                   1.2.3                    py27_0    omnia
ambermini                 16.16.0                       6    omnia
babel                     2.3.4                    py27_0  
cairo                     1.14.8                        0  
cffi                      1.9.1                    py27_0  
click                     6.7                       <pip>
clusterutils              0.1.2                    py27_0    omnia
conda                     4.3.22                   py27_0  
conda-env                 2.6.0                         0  
cryptography              1.7.1                    py27_0  
curl                      7.49.0                        1  
cycler                    0.10.0                   py27_0  
cython                    0.25.2                   py27_0  
dbus                      1.10.10                       0  
docopt                    0.6.2                    py27_1    omnia
docutils                  0.12                     py27_2  
enum34                    1.1.6                    py27_0  
expat                     2.1.0                         0  
fftw3f                    3.3.4                         2    omnia
Flask                     0.12                      <pip>
fontconfig                2.12.1                        3  
freetype                  2.5.5                         2  
functools32               3.2.3.2                  py27_0  
glib                      2.50.2                        1  
gst-plugins-base          1.8.0                         0  
gstreamer                 1.8.0                         0  
hdf4                      4.2.12                        0  
hdf5                      1.8.17                        1  
icu                       54.1                          0  
idna                      2.1                      py27_0  
imagesize                 0.7.1                    py27_0  
ipaddress                 1.0.17                   py27_0  
itsdangerous              0.24                      <pip>
jinja2                    2.8.1                    py27_0  
jpeg                      8d                            2  
latexcodec                1.0.1                    py27_0    omnia
libffi                    3.2.1                         1  
libgcc                    5.2.0                         0  
libgfortran               3.0.0                         1  
libiconv                  1.14                          0  
libnetcdf                 4.4.1                         0  
libpng                    1.6.27                        0  
libxcb                    1.12                          1  
libxml2                   2.9.4                         0  
markupsafe                0.23                     py27_2  
matplotlib                2.0.0               np111py27_0  
mdtraj                    1.8.0               np111py27_1    omnia
mkl                       2017.0.1                      0  
mpi4py                    2.0.0                    py27_2  
mpich2                    1.4.1p1                       0  
netcdf4                   1.2.4               np111py27_0  
nose                      1.3.7                    py27_1  
numexpr                   2.6.1               np111py27_2  
numpy                     1.11.3                   py27_0  
numpydoc                  0.6.0                    py27_0  
openmm                    7.2.0                    py27_0    omnia/label/dev
openmmtools               0.11.2                   py27_0    omnia
openmoltools              0.7.4                    py27_0    omnia
openssl                   1.0.2j                        0  
oset                      0.1.3                    py27_1    omnia
pandas                    0.19.2              np111py27_1  
parmed                    2.7.1                    py27_0    omnia
pcre                      8.39                          1  
pip                       8.1.2                    py27_0  
pixman                    0.34.0                        0  
pyasn1                    0.1.9                    py27_0  
pybtex                    0.18                     py27_0    omnia
pybtex-docutils           0.2.1                    py27_1    omnia
pycairo                   1.10.0                   py27_0  
pycosat                   0.6.1                    py27_1  
pycparser                 2.17                     py27_0  
pycrypto                  2.6.1                    py27_4  
pygments                  2.1.3                    py27_0  
pymbar                    3.0.0.beta2         np111py27_0    omnia
pyopenssl                 16.2.0                   py27_0  
pyparsing                 2.1.4                    py27_0  
pyqt                      5.6.0                    py27_2  
pytables                  3.3.0               np111py27_0  
python                    2.7.12                        1  
python-dateutil           2.6.0                    py27_0  
pytz                      2016.10                  py27_0  
pyyaml                    3.12                     py27_0  
qt                        5.6.2                         2  
readline                  6.2                           2  
redis                     2.10.5                    <pip>
requests                  2.12.4                   py27_0  
rq                        0.7.1                     <pip>
ruamel_yaml               0.11.14                  py27_0  
schema                    0.6.2                    py27_0    omnia
scipy                     0.18.1              np111py27_1  
setuptools                27.2.0                   py27_0  
sip                       4.18                     py27_0  
six                       1.10.0                   py27_0  
snowballstemmer           1.2.1                    py27_0  
sphinx                    1.5.1                    py27_0  
sphinxcontrib-bibtex      0.3.2                    py27_0    omnia
sqlite                    3.13.0                        0  
subprocess32              3.2.6                    py27_0    omnia
tk                        8.5.18                        0  
Werkzeug                  0.11.15                   <pip>
wheel                     0.29.0                   py27_0  
yaml                      0.1.6                         0  
yank                      0.16.2                   py27_0    omnia
zlib                      1.2.8                         3

jchodera commented 7 years ago

@andrrizzi @lnaden Can you prioritize tracking down this runtime speed regression next week? This is really concerning.

andrrizzi commented 7 years ago

This is weird. Could you post here the log file that should be in output/experiments/experiments.log? If you see multiple *.log, the one without numbers will suffice (every parallel process keeps a log).

spadavec commented 7 years ago

@andrrizzi attached here. I killed it pretty early (during the 3rd iteration), but I can run it more if you need me to.

experiments.txt

spadavec commented 7 years ago

whoops! Sorry about that!

jchodera commented 7 years ago

From the log:

2017-07-07 16:41:58,936 - DEBUG - openmmtools.utils - Computing energy matrix took 1055.912s

@andrrizzi : Could something funny be going on here?

andrrizzi commented 7 years ago

Thanks for the logs! That is very suspicious! I'll investigate.

andrrizzi commented 7 years ago

Ah sorry, I didn't notice that the calculation was done serially in this case. That's 1055s / (40x42) = 0.62s per energy evaluation. It seems pretty ok to me for a protein in PME solvent. I'll keep looking at the log.

spadavec commented 7 years ago

@andrrizzi is this a change from 0.15? I can use the same yaml file in previous versions and distributes it properly.

andrrizzi commented 7 years ago

No that shouldn't have changed much, but if you had the log file and/or the yaml file that you used in the previous calculation, that would be awesome.

spadavec commented 7 years ago

Attached the first ~6 iterations worth of the output for the 0.15 run. It used the same YAML file as the 0.16 run, which is attached in the .zip file in the original post.

015-experiments.txt

Lnaden commented 7 years ago

@andrrizzi I'm looking at the logs as well. Energy computation wall clock on the old version was 90s, on the new one was 1000+ That seems very odd, almost like its trashing the contexts each time. At least that is the only thing I can think of off the top of my head since both the logs were from serial runs on the same hardware. We'll keep looking into it.

andrrizzi commented 7 years ago

Ah! I think I've figured it out. I think it's an inefficiency in ThermodynamicState. I'll come up with a solution and push the fix over the weekend. Thanks for the log file @spadavec, that was really helpful!

jchodera commented 7 years ago

Thanks, everyone!

andrrizzi commented 7 years ago

Sorry, that was totally a red herring, there are things that can still be optimized in openmmtools, but I run few timeit and it looks like they can give us at most few extra seconds per iteration.

I believe most of this slow down comes from the fact that you are using openmm from the dev label. When I switched to that, the time to compute the energy matrix increased 6-7 times. More specifically, it seems to take ~10 times more to recompute the analytical long-range dispersion correction when we change lambda_sterics (probably some compiler flag deactivated?).

However, there was a change going from 0.15 to 0.16 in the way we compute the energy matrix. In 0.15 we computed the matrix by columns (we set lambda parameters in the context and keep setting positions and box vectors), in 0.16 we compute them by rows (we set positions and box vectors once and we keep changing lambda parameters). So, with the new way, we recompute the correction for all the states that involve a change in lambda_sterics. This is why with openmm-dev the problem got so accentuated.

I used your input files to run under few conditions. Here are the timings for computing the energy matrix

1. dev openmm 7.2, calculations by rows: ~2100s/iteration
2. main openmm 7.1.1, calculations by rows: ~320s/iteration
3. main openmm 7.1.1, calculations by columns (old method): ~230s/iteration
4. main openmm 7.1.1, calculations by columns (old method) with analytical correction disabled: ~225s/iteration
5. dev openmm 7.2, calculations by rows with analytical correction disabled: ~25s/iteration!!

I think we should consider estimating the loss of statistical efficiency without analytical correction when reweighting to the expanded cutoff state. If it's not too bad, it looks like disabling it should give us massive speed ups.

jchodera commented 7 years ago

Good catch, @andrrizzi!

I'm surprised at the difference between openmm 7.2 and 7.1.1 builds, but as @andrrizzi suggests, it's very possible a compiler flag accidentally got switched off. I can certainly get this fixed in a day or so.

We only recently noticed how long the recomputation of dispersion correction contributions takes; finings are summarized in this OpenMM issue: https://github.com/pandegroup/openmm/issues/1832

Essentially, CustomNonbondedForce takes an extremely long time to update the dispersion correction when the parameters change because it has to carry out a bunch of numerical quadrature computations, while NonbondedForce takes much less time because it can compute the integrals analytically. This computation is all currently done on the CPU, so it's not fast. We have a number of CustomNonbondedForces in our alchemically-modified systems, could amplify the problem.

The simplest fix, as @andrrizzi suggests, is to disable the long-range correction just for the alchemical region. For small molecules, this should be fine, and we can do it by just setting the disable_alchemical_dispersion_correction flag to True, though this may still incur the long-range dispersion correction overhead for NonbondedForce if OpenMM is not clever about figuring out which Force objects need dispersion corrections updated---we will have to look into the code to see. Disabling the entire dispersion correction would be problematic because it would result in significantly perturbed densities and a huge loss of alchemical overlap.

The scheme we suggested where we compute only the Force groups that have alchemical interactions would also speed things up significantly with disable_alchemical_dispersion_correction=True. It may be worth seeing how hard it would be to do this, since this could offer significant speed boosts over simply disabling the alchemical dispersion correction because we would avoid computing most interactions then.

Finally, some other ideas about how to speed things up:

Switch back to looping over positions, rather than alchemical states. (Apologies to @andrrizzi again for suggesting we switch this!)
Cache a separate Context for each alchemical state. Modern GPUs almost certainly have enough memory for systems of practical size.
Disable alchemical dispersion corrections just for the energy computation phase, though we will still incur overhead at the propagation phase if we switch the alchemical parameters in a single context.

Notably, this problem likely affects the propagation times too, though it's less noticeable because there are only O(nstates) recomputation times involved rather than O(nstates^2) for energy computations.

jchodera commented 7 years ago

Hm, there are a few minor path differences between the conda-recipes OpenMM build.sh and conda-dev-recipes OpenMM build.sh now, but I think both were recently updated (since the OpenMM 7.1.1 release) to switch from the omnia-build-box to the omnia-linux-anvil docker images which could be the source of the performance regression.

I'm roping in @Lnaden to help look into the OpenMM performance issues if simply equalizing the build.sh files doesn't fix the performance regression in recomputing dispersion forces between 7.1.1 and 7.2. We can likely come up with a very simple test case that just times how long it takes to compute energies for different lambda_sterics values for an alchemical system without needing to run YANK to test.

andrrizzi commented 7 years ago

if OpenMM is not clever about figuring out which Force objects need dispersion corrections updated

I think OpenMM handles it well because the first 17 states (were lambda_sterics is constant) are always computed very quickly (w/ dev openmm it takes about ~0.05s for constant lambda_sterics against ~2.6s when lambda_sterics changes). So in practice there should be only a single CustomNonbondedForce that gets recomputed.

Keep in mind that these numbers come from hal, were CPU operations are quite slow (at least for jobs in the gpu queue). Those numbers on lilac with same GPU are literally less than half, so the difference between computing the matrix by rows or columns might be much smaller (if not reversed). I can test this during the week.

jchodera commented 7 years ago

Great! I think we should first try the disable_alchemical_dispersion_correction=True solution first then. That might be sufficient for now.

spadavec commented 7 years ago

in the meantime, it looks like the safe thing (for me) is to roll back openmm to 7.1.1, openmmtools 0.9.3, and yank to 0.15. Is that correct?

jchodera commented 7 years ago

I think so! @andrrizzi can confirm.

andrrizzi commented 7 years ago

That would work! I think even only rolling back to the openmm in the main channel should bring the total runtime to something close to 3-3.5 days. Alternatively, you can go into the code and substitute this line with

alchemical_factory = mmtools.alchemy.AbsoluteAlchemicalFactory(disable_alchemical_dispersion_correction=True)

This should be even faster than 0.15 and bring the calculation to less than 2days.

I did check the overlap between the expanded cutoff states and the the alchemical state with and without analytical correction for the alchemical atoms, and there was very little difference. The std of the difference of the energies was ~0.015kT (even less for the solvent phase), so I don't think disabling the analytical correction will affect our free energies much.

jchodera commented 7 years ago

An alternative cause of the openmm version differences is this PR, which introduced changes to how the CustomNonbondedForce dispersion correction is computed in an attempt to speed things up:

https://github.com/pandegroup/openmm/pull/1841

spadavec commented 7 years ago

Just made the change @andrrizzi suggested (kept everything @ 0.16) and it indeed got the total runtime down to <2 days (id put a jeff_goldblum_says_wow.gif here if I had one).

andrrizzi commented 7 years ago

Glad to hear, @spadavec!

Based on the tests I run, simply disabling the correction should not give us problems. Nevertheless, we could think of modifying AbsoluteAlchemicalFactory to always disable the openmm correction, and include it in the energy expression of the CustomNonbondedForce as pre-computed additive term that depends on the lambda_sterics global parameter. Maybe I'm wrong, but I don't think the expression will be too hard to derive.

andrrizzi commented 7 years ago

Although now that I think about it, getting lambda_sterics out of the integral from the effective softcore radius term may not be trivial.

jchodera commented 7 years ago

Based on the tests I run, simply disabling the correction should not give us problems. Nevertheless, we could think of modifying AbsoluteAlchemicalFactory to always disable the openmm correction, and include it in the energy expression of the CustomNonbondedForce as pre-computed additive term that depends on the lambda_sterics global parameter. Maybe I'm wrong, but I don't think the expression will be too hard to derive.

I think we'll be fine with disable_alchemical_dispersion_correction=True for now!

andrrizzi commented 7 years ago

Ok! I'll open a PR soon with the change.

choderalab / yank

~7x increase in runtime from 0.15 to 0.16 for explicit calculations #705