Closed spadavec closed 7 years ago
Yikes. We've been tracking down some issues on our end where some recent changes caused huge slowdowns, so it's good to have another system where this is happening.
@lnaden is working through these performance issues one by one.
@lnaden: Can you dig into this and see if there is anything in addition to the performance issues you are already dealing with? It would also help to create individual tickets for the performance issues we've already identified so we can both better prioritize and gather more information from users.
Sure. I'll start marking each issue as I find it
I think I figured out the main problem, which is related to the ContextCache
configuration. YamlBuilder
leaves the default time to live (50 accesses) and sets the capacity of the cache to 3 (1 for the alchemical states and 2 for the expanded states), but during minimization we create a 4th Context
with the default integrator (gBAOAB) which is different than the one used to propagate (Langevin). Two things happen:
1) When the number of states is big, the two Context
s cached for the expanded cutoff states expire and are deleted during propagation.
2) When choosing a context for displacement and rotation moves (which don't require any specific integrator), ContextCache
alternatively picks the Context
with the Langevin integrator, and the one with the gBAOAB integrator created for the minimization. This keeps the latter Context
in memory, which sometimes causes the deletion of useful Context
s.
We need to 1) increase the time_to_live
of the context cache, 2) fix the ContextCache
to prefer non-default Context
s (I'll shortly open an issue on openmmtools
about this).
Hopefully this will fix it.
3 seems like a small number for the default context cache size.
What about defaulting to no limit on size until we run into a GPU out of memory error?
When an experiment is built by YamlBuilder
, 3 should be the exact size we need. You need a higher number of Context
s only when you modify the experiments through the API before running them, and at that point you can change the configuration of the global context cache.
Handling the segfaults caused by out of memory errors and automatic sizing would be nice though! I'll open an issue in openmmtools
, but for now I think we're stuck with trying to prevent errors.
When an experiment is built by YamlBuilder, 3 should be the exact size we need.
The ContextCache
is going to be used much more broadly than just YANK. My guess is that someone who would want to go through the trouble of using an LRUCache
would probably want it to cache a bit more than three contexts, or to have it figure out how many can fit on the GPU automatically (but trying to keep caching until it fails).
The
ContextCache
is going to be used much more broadly than just YANK.
Sure! I'm not saying that the default capacity should be 3, but that YamlBuilder
sets it to 3 before running the experiment because it doesn't need more (see here).
That number is right now application-dependent and the global cache should be always configured. I agree though that we want to handle that automatically. Only that's an extra feature that is not implemented right now. I'll open an issue in openmmtools
.
Sure! I'm not saying that the default capacity should be 3, but that YamlBuilder sets it to 3 before running the experiment because it doesn't need more (see here).
Aha! Thanks for the clarification!
@andrrizzi : Do you think https://github.com/choderalab/openmmtools/pull/235 addressed this speed issue?
It should! At least a big chunk of it.
@spadavec We will be cutting a release for YANK and OpenMMTools on Friday (7/7/17) with the fixes to hopefully resolve this issue. If you don't want to wait, you can pull down the current GitHub versions and start testing today.
@Lnaden I should be free tomorrow to try the new release and will post results as soon as I get them. Thanks for the update!
@spadavec We have just published the packages, here within the hour, they should be available through conda
. Feel free to download them and test when you can!
Thanks for the patience as we have gotten these fixes in and let us know if this gets your simulations back up to speed!
@Lnaden updated to yank 0.16.2
and openmm to 7.2.0
using conda
, and it looks like I'm getting roughly the same runtime (13d 21h (0.16.2
) vs. 14d 5h (0.16.1
)) (worth noting that I do very much appreciate the ~9x speed up in replica minimization though!)
Can you double check the conda update also updated openmmtools to 0.11.2?
openmmtools is 0.11.2
yanker@hopper:~/yank_files/kras$ conda list
# packages in environment at /home/yanker/miniconda2:
#
alabaster 0.7.9 py27_0
alchemy 1.2.3 py27_0 omnia
ambermini 16.16.0 6 omnia
babel 2.3.4 py27_0
cairo 1.14.8 0
cffi 1.9.1 py27_0
click 6.7 <pip>
clusterutils 0.1.2 py27_0 omnia
conda 4.3.22 py27_0
conda-env 2.6.0 0
cryptography 1.7.1 py27_0
curl 7.49.0 1
cycler 0.10.0 py27_0
cython 0.25.2 py27_0
dbus 1.10.10 0
docopt 0.6.2 py27_1 omnia
docutils 0.12 py27_2
enum34 1.1.6 py27_0
expat 2.1.0 0
fftw3f 3.3.4 2 omnia
Flask 0.12 <pip>
fontconfig 2.12.1 3
freetype 2.5.5 2
functools32 3.2.3.2 py27_0
glib 2.50.2 1
gst-plugins-base 1.8.0 0
gstreamer 1.8.0 0
hdf4 4.2.12 0
hdf5 1.8.17 1
icu 54.1 0
idna 2.1 py27_0
imagesize 0.7.1 py27_0
ipaddress 1.0.17 py27_0
itsdangerous 0.24 <pip>
jinja2 2.8.1 py27_0
jpeg 8d 2
latexcodec 1.0.1 py27_0 omnia
libffi 3.2.1 1
libgcc 5.2.0 0
libgfortran 3.0.0 1
libiconv 1.14 0
libnetcdf 4.4.1 0
libpng 1.6.27 0
libxcb 1.12 1
libxml2 2.9.4 0
markupsafe 0.23 py27_2
matplotlib 2.0.0 np111py27_0
mdtraj 1.8.0 np111py27_1 omnia
mkl 2017.0.1 0
mpi4py 2.0.0 py27_2
mpich2 1.4.1p1 0
netcdf4 1.2.4 np111py27_0
nose 1.3.7 py27_1
numexpr 2.6.1 np111py27_2
numpy 1.11.3 py27_0
numpydoc 0.6.0 py27_0
openmm 7.2.0 py27_0 omnia/label/dev
openmmtools 0.11.2 py27_0 omnia
openmoltools 0.7.4 py27_0 omnia
openssl 1.0.2j 0
oset 0.1.3 py27_1 omnia
pandas 0.19.2 np111py27_1
parmed 2.7.1 py27_0 omnia
pcre 8.39 1
pip 8.1.2 py27_0
pixman 0.34.0 0
pyasn1 0.1.9 py27_0
pybtex 0.18 py27_0 omnia
pybtex-docutils 0.2.1 py27_1 omnia
pycairo 1.10.0 py27_0
pycosat 0.6.1 py27_1
pycparser 2.17 py27_0
pycrypto 2.6.1 py27_4
pygments 2.1.3 py27_0
pymbar 3.0.0.beta2 np111py27_0 omnia
pyopenssl 16.2.0 py27_0
pyparsing 2.1.4 py27_0
pyqt 5.6.0 py27_2
pytables 3.3.0 np111py27_0
python 2.7.12 1
python-dateutil 2.6.0 py27_0
pytz 2016.10 py27_0
pyyaml 3.12 py27_0
qt 5.6.2 2
readline 6.2 2
redis 2.10.5 <pip>
requests 2.12.4 py27_0
rq 0.7.1 <pip>
ruamel_yaml 0.11.14 py27_0
schema 0.6.2 py27_0 omnia
scipy 0.18.1 np111py27_1
setuptools 27.2.0 py27_0
sip 4.18 py27_0
six 1.10.0 py27_0
snowballstemmer 1.2.1 py27_0
sphinx 1.5.1 py27_0
sphinxcontrib-bibtex 0.3.2 py27_0 omnia
sqlite 3.13.0 0
subprocess32 3.2.6 py27_0 omnia
tk 8.5.18 0
Werkzeug 0.11.15 <pip>
wheel 0.29.0 py27_0
yaml 0.1.6 0
yank 0.16.2 py27_0 omnia
zlib 1.2.8 3
@andrrizzi @lnaden Can you prioritize tracking down this runtime speed regression next week? This is really concerning.
This is weird. Could you post here the log file that should be in output/experiments/experiments.log
? If you see multiple *.log
, the one without numbers will suffice (every parallel process keeps a log).
@andrrizzi attached here. I killed it pretty early (during the 3rd iteration), but I can run it more if you need me to.
whoops! Sorry about that!
From the log:
2017-07-07 16:41:58,936 - DEBUG - openmmtools.utils - Computing energy matrix took 1055.912s
@andrrizzi : Could something funny be going on here?
Thanks for the logs! That is very suspicious! I'll investigate.
Ah sorry, I didn't notice that the calculation was done serially in this case. That's 1055s / (40x42) = 0.62s per energy evaluation. It seems pretty ok to me for a protein in PME solvent. I'll keep looking at the log.
@andrrizzi is this a change from 0.15
? I can use the same yaml file in previous versions and distributes it properly.
No that shouldn't have changed much, but if you had the log file and/or the yaml file that you used in the previous calculation, that would be awesome.
Attached the first ~6 iterations worth of the output for the 0.15
run. It used the same YAML file as the 0.16
run, which is attached in the .zip file in the original post.
@andrrizzi I'm looking at the logs as well. Energy computation wall clock on the old version was 90s, on the new one was 1000+ That seems very odd, almost like its trashing the contexts each time. At least that is the only thing I can think of off the top of my head since both the logs were from serial runs on the same hardware. We'll keep looking into it.
Ah! I think I've figured it out. I think it's an inefficiency in ThermodynamicState
. I'll come up with a solution and push the fix over the weekend. Thanks for the log file @spadavec, that was really helpful!
Thanks, everyone!
Sorry, that was totally a red herring, there are things that can still be optimized in openmmtools
, but I run few timeit
and it looks like they can give us at most few extra seconds per iteration.
I believe most of this slow down comes from the fact that you are using openmm
from the dev
label. When I switched to that, the time to compute the energy matrix increased 6-7 times. More specifically, it seems to take ~10 times more to recompute the analytical long-range dispersion correction when we change lambda_sterics
(probably some compiler flag deactivated?).
However, there was a change going from 0.15 to 0.16 in the way we compute the energy matrix. In 0.15 we computed the matrix by columns (we set lambda parameters in the context and keep setting positions and box vectors), in 0.16 we compute them by rows (we set positions and box vectors once and we keep changing lambda parameters). So, with the new way, we recompute the correction for all the states that involve a change in lambda_sterics
. This is why with openmm-dev
the problem got so accentuated.
I used your input files to run under few conditions. Here are the timings for computing the energy matrix
1. dev openmm 7.2, calculations by rows: ~2100s/iteration
2. main openmm 7.1.1, calculations by rows: ~320s/iteration
3. main openmm 7.1.1, calculations by columns (old method): ~230s/iteration
4. main openmm 7.1.1, calculations by columns (old method) with analytical correction disabled: ~225s/iteration
5. dev openmm 7.2, calculations by rows with analytical correction disabled: ~25s/iteration!!
I think we should consider estimating the loss of statistical efficiency without analytical correction when reweighting to the expanded cutoff state. If it's not too bad, it looks like disabling it should give us massive speed ups.
Good catch, @andrrizzi!
I'm surprised at the difference between openmm
7.2 and 7.1.1 builds, but as @andrrizzi suggests, it's very possible a compiler flag accidentally got switched off. I can certainly get this fixed in a day or so.
We only recently noticed how long the recomputation of dispersion correction contributions takes; finings are summarized in this OpenMM issue: https://github.com/pandegroup/openmm/issues/1832
Essentially, CustomNonbondedForce
takes an extremely long time to update the dispersion correction when the parameters change because it has to carry out a bunch of numerical quadrature computations, while NonbondedForce
takes much less time because it can compute the integrals analytically. This computation is all currently done on the CPU, so it's not fast. We have a number of CustomNonbondedForces
in our alchemically-modified systems, could amplify the problem.
The simplest fix, as @andrrizzi suggests, is to disable the long-range correction just for the alchemical region. For small molecules, this should be fine, and we can do it by just setting the disable_alchemical_dispersion_correction
flag to True
, though this may still incur the long-range dispersion correction overhead for NonbondedForce
if OpenMM is not clever about figuring out which Force
objects need dispersion corrections updated---we will have to look into the code to see. Disabling the entire dispersion correction would be problematic because it would result in significantly perturbed densities and a huge loss of alchemical overlap.
The scheme we suggested where we compute only the Force
groups that have alchemical interactions would also speed things up significantly with disable_alchemical_dispersion_correction=True
. It may be worth seeing how hard it would be to do this, since this could offer significant speed boosts over simply disabling the alchemical dispersion correction because we would avoid computing most interactions then.
Finally, some other ideas about how to speed things up:
Context
for each alchemical state. Modern GPUs almost certainly have enough memory for systems of practical size.Notably, this problem likely affects the propagation times too, though it's less noticeable because there are only O(nstates
) recomputation times involved rather than O(nstates^2
) for energy computations.
Hm, there are a few minor path differences between the conda-recipes
OpenMM build.sh
and conda-dev-recipes
OpenMM build.sh
now, but I think both were recently updated (since the OpenMM 7.1.1 release) to switch from the omnia-build-box
to the omnia-linux-anvil
docker images which could be the source of the performance regression.
I'm roping in @Lnaden to help look into the OpenMM performance issues if simply equalizing the build.sh
files doesn't fix the performance regression in recomputing dispersion forces between 7.1.1 and 7.2. We can likely come up with a very simple test case that just times how long it takes to compute energies for different lambda_sterics
values for an alchemical system without needing to run YANK to test.
if OpenMM is not clever about figuring out which Force objects need dispersion corrections updated
I think OpenMM handles it well because the first 17 states (were lambda_sterics
is constant) are always computed very quickly (w/ dev openmm it takes about ~0.05s for constant lambda_sterics
against ~2.6s when lambda_sterics
changes). So in practice there should be only a single CustomNonbondedForce
that gets recomputed.
Keep in mind that these numbers come from hal, were CPU operations are quite slow (at least for jobs in the gpu queue). Those numbers on lilac with same GPU are literally less than half, so the difference between computing the matrix by rows or columns might be much smaller (if not reversed). I can test this during the week.
Great! I think we should first try the disable_alchemical_dispersion_correction=True
solution first then. That might be sufficient for now.
in the meantime, it looks like the safe thing (for me) is to roll back openmm to 7.1.1
, openmmtools 0.9.3
, and yank to 0.15
. Is that correct?
I think so! @andrrizzi can confirm.
That would work! I think even only rolling back to the openmm in the main channel should bring the total runtime to something close to 3-3.5 days. Alternatively, you can go into the code and substitute this line with
alchemical_factory = mmtools.alchemy.AbsoluteAlchemicalFactory(disable_alchemical_dispersion_correction=True)
This should be even faster than 0.15 and bring the calculation to less than 2days.
I did check the overlap between the expanded cutoff states and the the alchemical state with and without analytical correction for the alchemical atoms, and there was very little difference. The std of the difference of the energies was ~0.015kT (even less for the solvent phase), so I don't think disabling the analytical correction will affect our free energies much.
An alternative cause of the openmm version differences is this PR, which introduced changes to how the CustomNonbondedForce
dispersion correction is computed in an attempt to speed things up:
Just made the change @andrrizzi suggested (kept everything @ 0.16
) and it indeed got the total runtime down to <2 days (id put a jeff_goldblum_says_wow.gif here if I had one).
Glad to hear, @spadavec!
Based on the tests I run, simply disabling the correction should not give us problems. Nevertheless, we could think of modifying AbsoluteAlchemicalFactory
to always disable the openmm correction, and include it in the energy expression of the CustomNonbondedForce
as pre-computed additive term that depends on the lambda_sterics
global parameter. Maybe I'm wrong, but I don't think the expression will be too hard to derive.
Although now that I think about it, getting lambda_sterics
out of the integral from the effective softcore radius term may not be trivial.
Based on the tests I run, simply disabling the correction should not give us problems. Nevertheless, we could think of modifying AbsoluteAlchemicalFactory to always disable the openmm correction, and include it in the energy expression of the CustomNonbondedForce as pre-computed additive term that depends on the lambda_sterics global parameter. Maybe I'm wrong, but I don't think the expression will be too hard to derive.
I think we'll be fine with disable_alchemical_dispersion_correction=True
for now!
Ok! I'll open a PR soon with the change.
Running the same calculation (e.g. same inputs and yaml files) using yank
0.15.2.dev0
vs0.16.1
causes the total runtime to increase from ~2.5 days to ~14 days. I've looked through the yaml documentation, and I don't see anything that could have changed from version to version (with the exception of how MPI runs are conducted, but this calculation was done in serial), but have a feeling I'm wrong there. I've attached my input files here; am I missing something very obvious, or are the estimated time-to-completions in0.16
not accurate?Note: Within the zipfile, the
nohup.out
is the (limited) output from the0.16
run, and I'm using the same hardware for both runs (1080 gtx, i7, ~50GB memory)4LUC.zip.tar.gz