Should Subsampling be Recommended?

fjclark commented 3 days ago

My understanding is: subsampling is recommended so that Equation 4.2 of Kong et al., 2003 , which is derived for uncorrelated samples, can be used to estimate the variance. However, subsampling increases the variance. It seems unintuitive to increase the variance so that it can be better estimated. Would it not be better to minimise the variance by retaining all samples, and use a variance estimator which directly accounts for autocorrelation?

The Issue: Subsampling Increases the Variance

Geyer, 1992 (Section 3.6) discusses subsampling. He points out:

Subsampling decreases the statistical inefficiency in units of samples, but increases the statistical inefficiency in units of sampling time (Theorem 3.3)
"If the cost of using samples is negligible, any subsampling is wrong. One doesn't get a better answer by throwing away data."

I'm assuming that the cost of using samples is generally negligible compared to the cost of generating them.

The increase in variance caused by subsampling seems to be shown, for example, in Table III of Tan, 2012, where the variance of the MBAR/UWHAM uncertainties increase after subsampling (the variances without subsampling are calculated using block-bootstrapping).

Possible Solutions: Directly Accounting for Autocorrelation in the Variance Estimates

To account for autocorrelation in the variance estimates without subsampling, block bootstrapping could be used, with the block size selected according to the procedure of Politis and White, 2004 (and correction), for example. However, I understand that fast analytical estimates may be preferred to avoid repeated MBAR evaluations. Could the analytical estimates from Geyer, 1994/ Li et al., 2023 be used?

Why This May Be Irrelevant

I'm biased by the fact I work with ABFE calculations and regularly feed MBAR very highly correlated data which are aggressively subsampled, sometimes producing unreliable estimates (which are reasonable without subsampling). I understand that for most applications relatively few samples will be discarded and any increase in uncertainty may be small.

It would be great to hear some thoughts on this/ be corrected if I am misunderstanding.

Thanks!

mrshirts commented 3 days ago

I am extremely interested in this, and think we need to examine this. I suspect block bootstrapping is going to be the best way to do this. There are issues at small sample number I have seen with most analytical estimates, but I would like to investigate the two you list a bit more. I am going to be tied up for the next 10 days or so (and then digging out afterwards) but this is very interesting to me, and I would love to talk more.

I'd love to get a quantitative estimate of the additional errors introduced by subsampling. I would say in MBAR the biggest issue is poor estimation of the correlation time leading to too aggressive subsampling (another thing we have talked about), but even if the autocorrelation time is estimated correctly. We should think of the right experiments to show this (perhaps artificial data generated with a autoregressive model so that we know the autocorrelation time exactly).

fjclark commented 1 day ago

Thanks for your interest.

I've played with a toy model - 4 harmonic oscialltors with the same force constant and equal spacing, so that all free energy changes are 0 for simplicity. I've modified HarmonicOscillatorsTestCase so that correlated time series are generated for each osciallator, and I've introduced exponentially-decaying correlation with specified $\tau$ using the correlated_timeseries_example function.

For varying $\tau$ s, I have tested:

MBAR without subsampling
MBAR with subsampling, where I have subsampled according to the true statistical inefficiency to avoid noise introduced by estimating it

For each $\tau$ for each protocol, I've run 1000 independent repeats, sampling time series of length 1000 from each of the oscillators. I've computed the "ensemble SD" as the standard deviation of the means over the independent repeats, and "mean estimated SD" is the mean of the analytical estimates:

This shows:

As expected, the mean estimated SD with no subsampling does not change with increasing autocorrelation, only agrees with the true ensemble SD when $\tau \approx 0$, and becomes a large underestimate when $\tau$ increases.
The true ensemble SD with subsampling is consistently higher than without subsampling. I would expect the relationship between the subsampled and unsubsampled variances to be:

Where Eff denotes an effective sample size, $g$ is a statistical inefficiency, and Sub denotes a quantity related to the subsampled data. $g$ is worked out from Eqn 45 in Janke, 2002. This relationship nicely predicts the empirical results:

An increase in the variance by ~ 30 % isn't a huge price to pay for subsampling, but:

The mean estimated subsampled SD is an underestimate of the true ensemble subsampled SD. This makes sense because the number of samples remaining after subsampling, $N\mathrm{subsampled}$, is equivalent to the effective number of samples before subsampling, $N\mathrm{eff}$, but is greater than the effective number of samples after subsampling $N\mathrm{eff, subsampled}$: $N\mathrm{subsampled} = N\mathrm{eff} > N\mathrm{eff, subsampled}$. In other words, because subsampling does not completely remove correlation, the analytical SDs for the subsampled series underestimate the true SDs of the subsampled series.
While the mean estimated subsampled SD underestimates the true ensemble subsampled SD, it reliably estimates the true ensemble non subsampled SD. This makes sense because $N\mathrm{subsampled} = N\mathrm{eff}$. So, a simple default for well-behaved systems with enough samples may be: estimate the free energy change from the unsubsampled input, then estmate the uncertainty from the subsampled input using the analytical estimate. This avoids subsampling adding noise to the free energy estimate. The price of simplicity is that subsampling adding some noise to the uncertainty estimate. As a sanity check, I checked that the mean estimated SD of the double-subsampled inputs agreed with the true ensemble subsampled SD (see purple dashed line). The second round of subsampling was done after adjusting $\tau$ for the first round of subsampling.

Code

All code and the conda environment used: gh_subsampling_issue.tar.gz

choderalab / pymbar