Liquid mixture discussion: Simulation procedures

jchodera commented 10 years ago

It would be useful to collect some discussion about the appropriate simulation procedures for the liquid mixtures project here and then do some tweaking of the code afterwards.

I'm looking through the current production code in use by @juliebehr: https://github.com/juliebehr/TrustButVerify/blob/liquids/trustbutverify/mixture_system.py

The current simulation protocol uses the following definitions:

N_STEPS_MIXTURES = 25000000 # 50 ns
N_EQUIL_STEPS_MIXTURES = 5000000 # 5ns
OUTPUT_FREQUENCY_MIXTURES = 499 #500

I'll break my comments into sections.

Total simulation length. A total simulation length of 50 ns may be overkill, but it really depends on the correlation time for the box volume and the standard deviation of the instantaneous density. This is something we can get a feel for later on.

Equilibration length. We should never, never use a fixed equilibration time and assume this is sufficient for all cases. This is a dangerous practice, since it could be that some systems relax more quickly than others, or some preparation steps require longer times to relax to the typical equilibrium state than others. A better approach would be to simply store all the data and examine it after the fact to automatically detect the region that we discard to "equilibration". We should use my detectEquilibration() function from the pymbar.timeseries module to do this: https://github.com/choderalab/pymbar/blob/master/pymbar/timeseries.py#L724 I hate to per persnickety about this, but if we deviate from this recommendation, it essentially undermines my campaign to get other important simulation labs to embrace "best practices" behaviors here. I know this scheme isn't yet published, but we can put out a quick paper describing it over the next week or two.

Output frequency. Shouldn't the output frequency be 500, rather than 499, for a 1 ps output interval? Also, we really should have different output frequencies for different types of data:

*Instantaneous density. We want to write this out frequently, since this is the quantity we want to estimate the ensemble average of. Writing this quantity out every 1 ps is totally fine, though we could even write more frequently than this (e.g. every 0.5 ps or 0.25 ps) if retrieving the box vectors and computing the box volume is an inexpensive operation. We should probably construct a separate DensityReporter that just computes the instantaneous density and writes it to a file at desired intervals.
Configurations. Writing these out at the same interval, 1 ps, is totally unnecessary and likely responsible for slowing things down quite a bit, as well as using up an enormous amount of storage unnecessarily and putting a strain on our poor storage system. Each 50 ns simulation would have 50K frames of at least (256 molecules) (3 atoms/molecule) (3 coordinates/atom) (4 bytes/coordinate) = 9.2KB, or 460MB for just water. For systems of 30 atoms/molecule, this would be 4.6GB/trajectory! Instead, we can reduce this to writing configurations every 10 ps for now, which greatly reduces this problem. The only reason we might want an enormous number of configurations in future is the possibility of doing parameter optimization by reweighting, but we aren't yet doing this, we don't know the appropriate statistical inefficiencies for determining appropriate saving intervals, and we don't know how bad the reweighting problem is going to be yet, so it is unnecessary at this point.

There are other issues we may need to look into eventually:

Single vs double precision. We may need to check the robustness of our results to single vs double precision. I don't think these precision modes have been fully validated.
PME parameters. We may need to revisit the sensitivity of our results to choice of PME parameters, since the default OpenMM PME parameters are "sloppy".

kyleabeauchamp commented 10 years ago

So I agree that we want to avoid using the fixed equilibration length, but to really make your scheme widely used we need embed it in the simulation software. It really needs to be fully automated, otherwise people won't bother. It doesn't have to be hard: we could probably implement it in OpenMM via some sort of EquilibrationReporter object that accumulates various 1D observables and runs equilibration detection on them. We could include this reporter in PyMBAR and have it conda-installable, so it wouldn't have to be burdensome for people to use, at least in OpenMM.

kyleabeauchamp commented 10 years ago

Regarding the output frequency, we have already done the statistical inefficiency calculation for a few systems. A reasonable choice of uncorrelated samples would be ~3 ps--10 ps is probably wasting cycles. Obviously this was a preliminary calculation that we did long ago and will need to be revisited.

kyleabeauchamp commented 10 years ago

To clarify, my 3 ps guess is based on the inefficiency of the density, not the potential, so that will need to be repeated as well. I think we might as well go to 5 ps or so for now to avoid clogging our disk space.

kyleabeauchamp commented 10 years ago

Regarding the length of the production simulations: if we implement the automated equilibration detection, we should also have a crude estimate of how long we need our production runs to be, right?

kyleabeauchamp commented 10 years ago

Why do we need a density reporter if we already have a StateReporter that can be used to calculate the density?

kyleabeauchamp commented 10 years ago

Regarding precision modes: the default is Mixed, correct? IMHO we should not even be considering single, right?

kyleabeauchamp commented 10 years ago

I think it makes sense to triage this list in order of how quickly each task can be implemented. Below is my interpretation:

Change output frequency of coordinates to, say, 5 ps (a compromise)
Benchmark the speed of the density reporter to optimize the density output frequency without hurting performance. (transferring the "state" from GPU can be slow)
Evaluate PME settings. Via reweighting?
Evaluate precision modes. I think
Implement a Reporter based equilibration detection module for OpenMM?

jchodera commented 10 years ago

So I agree that we want to avoid using the fixed equilibration length, but to really make your scheme widely used we need embed it in the simulation software. It really needs to be fully automated, otherwise people won't bother. It doesn't have to be hard: we could probably implement it in OpenMM via some sort of EquilibrationReporter object that accumulates various 1D observables and runs equilibration detection on them. We could include this reporter in PyMBAR and have it conda-installable, so it wouldn't have to be burdensome for people to use, at least in OpenMM.

This isn't quite appropriate or practical since the equilibration detection is a purely post-processing method. If we're interested in the density, we simply store the entire density timeseries during the simulation at some convenient sampling interval and then use the pymbar.timeseries.detectEquilibration() method to extract the quantity of data to discard to equilibration and statistical inefficiency. Usage is like this:

from pymbar import timeseries
# Detect equilibrated region and compute statistical inefficiency of equilibrated region.
[t0, g, Neff] = timeseries.detectEquilibration(density_timeseries)
# Discard initial data to equilibration.
density_timeseries = density_timeseries[t0:]
# Compute expectation and statistical error estimate.
density_mean = density_timeseries.mean()
density_mean_stderr = density_timeseries.std() / numpy.sqrt(Neff)

We are already conda-packaging pymbar and including it as part of Omnia. Unless we also construct an automated analysis facility for the OpenMM app, there's no tighter way to integrate it, other than perhaps to create a single function that implements the small code block above.

Separately, I am lobbying people like Schrödinger to do this automatically.

And writing a short 2-4 page paper soon would be very useful as well.

jchodera commented 10 years ago

Regarding the output frequency, we have already done the statistical inefficiency calculation for a few systems. A reasonable choice of uncorrelated samples would be ~3 ps--10 ps is probably wasting cycles. Obviously this was a preliminary calculation that we did long ago and will need to be revisited.

The proper way to do this would be to use a frequent sampling interval for densities (e.g. 0.1 ps) since the storage requirements are minimal and then compute the correlation times and statistical inefficiencies for this property for all mixtures in our set when reporting them in the first paper.

We don't even really need to record positions for this first paper characterizing deficiencies in densities of mixture properties, but storing some (e.g. 10 ps intervals) could be useful for doing preliminary work for a second paper focusing on using reweighting to do optimize parameters in a Bayesian framework. But we don't need to generate TB of data at this stage unnecessarily.

jchodera commented 10 years ago

Regarding precision modes: the default is Mixed, correct? IMHO we should not even be considering single, right?

I'm not certain. We should check this. I'm not convinced this works properly on all hardware, however, and I have found the precision improvement provided to be unexpectedly poor or perhaps even broken. See, for example: https://github.com/SimTk/openmm/issues/441

jchodera commented 10 years ago

Why do we need a density reporter if we already have a StateReporter that can be used to calculate the density?

Which reporter is this? Can you point me to the code?

kyleabeauchamp commented 10 years ago

https://github.com/SimTk/openmm/blob/master/wrappers/python/simtk/openmm/app/statedatareporter.py#L71

jchodera commented 10 years ago

Regarding the length of the production simulations: if we implement the automated equilibration detection, we should also have a crude estimate of how long we need our production runs to be, right?

Not quite. It's more complex than this.

Our simulation needs to be T_equil + T_prod, where T_equil is long enough to reach the "typical equilibrium region" and T_prod depends on the statistical inefficiency once in the equilibrium region, the variance of the property of interest (in this case the instantaneous density), and the desired target statistical error for the estimate.

What are we aiming for in terms of target statistical error? Since densities are ~1 g/cm3, I imagine we want better than 0.1% error, perhaps 0.01% error (0.0001 g/cm3)?

kyleabeauchamp commented 10 years ago

I think we can keep a loose target (0.3%?) until we fix all the other bugs in our pipeline. I imagine we want to do things iteratively: run the entire benchmark, look for bugs, fix bugs, lower our tolerance, repeat...

jchodera commented 10 years ago

To clarify, my 3 ps guess is based on the inefficiency of the density, not the potential, so that will need to be repeated as well. I think we might as well go to 5 ps or so for now to avoid clogging our disk space

Note that if we're computing average densities, the statistical inefficiency of the density, not the potential energy, is the relevant quantity.

Since the instantaneous density is just a single number per snapshot, there is no reason to not write it out frequently (eg every 0.1 or 0.25 ps) if doing so does not slow down the simulation.

kyleabeauchamp commented 10 years ago

Right, but if we're seeking to do reweighting based on the potential energy (or its components), then those would be relevant as well. Regardless that's a future battle so we can ignore for now.

jchodera commented 10 years ago

I think it makes sense to triage this list in order of how quickly each task can be implemented.

Here's what I think we should do:

Simple, quick changes:

Change output frequency for configurations to 10 ps, since the only current use for these is debugging issues via visualization.
Change output frequency for instantaneous densities to 0.25 ps, since these are essentially "free" in terms of disk space and computation effort.
Use detectEquilibration to compute average densities, statistical errors, standard deviations, statistical inefficiencies, and equilibration times, and store these for each simulation:

from pymbar import timeseries
# Detect equilibrated region and compute statistical inefficiency of equilibrated region.
[t0, g, Neff] = timeseries.detectEquilibration(density_timeseries)
# Discard initial data to equilibration.
density_timeseries = density_timeseries[t0:]
# Compute expectation and statistical error estimate.
density_mean = density_timeseries.mean()
density_stddev = density_timeseries.std()
density_mean_stderr = density_timeseries.std() / numpy.sqrt(Neff)

After this, but before first paper:

For a subset of liquids, investigate dependence of computed densities on PME parameters and precision model.
If we want to continue to use the simtk.openmm.app.Simulation and Reporter classes (I'm not sure we do---I've always thought these were pretty useless), refactor our tools into a Reporter for each physical property of interest so we can mix and match. For example, DensityReporter, HeatOfVaporizationReporter, etc.

Longer term:

Implement a test script to test reweighting strategies. But this could be after the first paper is mostly in the can.

jchodera commented 10 years ago

I think we can keep a loose target (0.3%?) until we fix all the other bugs in our pipeline. I imagine we want to do things iteratively: run the entire benchmark, look for bugs, fix bugs, lower our tolerance, repeat...

The DM40 density meter has an accuracy of 0.0001 g/cm3. I imagine we'd want to take a small but diverse set and see what it takes to get comparable statistical errors. From the data we collect on equilibration times, statistical inefficiencies, and standard deviations of the instantaneous density, we will have a good idea of the simulation lengths required. Then we can scale up.

Without data on these properties (equilibration times, statistical inefficiencies, and standard deviations of the instantaneous density), we have no way of estimating how long we need to run. So let's get this info first.

jchodera commented 10 years ago

For reference, here's how we would estimate required simulation times:

T = Tequil + Tprod
Tequil is the expected equilibration time required
Tprod = stddev * g / stderr

where stddev is the standard deviation of the property being averaged (instantaneous density) and stderr is the desired statistical precision. g is the statistical inefficiency.

choderalab / TrustButVerify

Liquid mixture discussion: Simulation procedures #7