Question about free enegry uncertainties

harlor commented 6 years ago

I noticed that some people (for example: https://github.com/MobleyLab/alchemical-analysis/blob/37b21ebb6d0495990c43595d9ade10c779c3ab08/alchemical_analysis/alchemical_analysis.py#L650) just use the sum of the squares of uncertainties between intermediate states to calculate the uncertainty of the total difference in free energy. This requires the free energy differences to be uncorrelated correct?

But is this really the case? We know that we (can) have nonzero contributions in the covariance matrix (eq 12 https://aip.scitation.org/doi/10.1063/1.2978177).

In my random example this makes a huge difference in the uncertainty of the free energy difference:

using (eq 12) - np.sqrt(Theta[0, 0] - 2*Theta[0, -1] + Theta[-1, -1]) err: 0.072118

using sum of the squares - alchemical-analysis: err: 0.032283

harlor commented 6 years ago

Btw. I can repoduce the alchemical-analysis value using:

var = 0.0
for i in range(K-1):
  var += Theta[i, i] - 2*Theta[i, i+1] + Theta[i+1, i+1]
print('err2: %f' % (np.sqrt(var) / B))

err2: 0.032283

jchodera commented 6 years ago

You are correct that using the sum of squares of uncertainties between sequential pairs of intermediate states requires the data used to estimate those differences to be uncorrelated! Since practitioners typically use the data for an intermediate state i to estimate both (i-1, i) and (i,i+1) free energies, this condition is generally violated, and the uncertainty can be underestimated.

Pymbar should give a superior estimate of the uncertainty in this case because, as you point out, it can include the covariance not only in sequential pairs of free energy differences, but in the free energy estimate obtained from all the data jointly.

harlor commented 6 years ago

Thanks for your fast reply! You convinced me that it is a good idea to change this.

SmithUoG commented 5 years ago

If you have the computer resources available, I think that a preferable way to calculate the uncertainty (precision) in particular quantity z obtained using an approximate algorithm is to treat the algorithm the same way that experimentalists treat their experimental apparatus. This means that one would perform replicate independent "experiments" and take the final result as the mean value and its standard deviation as a measure of the uncertainty/precision. An "independent run" is one that, for example, uses different random number seeds wherever they are invoked in the algorithm.

So, I would rather calculate uncertainties this way (from whatever method I'm using, whether it be BAR, MBAR, TI, ...) and report its value. In our lab, we typically perform 10 such independent runs for a given quantity and report the uncertainty (the value "10" is both convenient and provides a not unreasonable approximation of a normal distribution by a t distribution). This applies to the calculation of free energy quantities, as well as to PVT properties (as a replacement for block averaging).

As an example, we have calculated (at 1 bar and 298.15K) 10 independent replicate Delta G(hyd) values for MEA in TIP3P water using the BAR method, and obtained an uncertainty of 0.09 kJ/mol. The reported individual BAR values for the uncertainty of each run ranged from 0.04 to 0.10.

For some discussion of this and related issues, see A. Nicholls, J Comput Aided Mol Des (2014) 28:887–918, and (2016) 30:103–126.

William R. Smith, Un. of Guelph

choderalab / pymbar

Question about free enegry uncertainties #304