Options to do bootstrapping

mrshirts commented 6 years ago

Laying out the issues for including bootstrapping in MBAR. There's essentially thee

One relatively easy thing to do is collect bootstrap samples within init. This can generate a set of free energies that can then be processed by the individual routines to generate various statistics (standard deviations, confidence intervals, etc). Storing Nboots x K (where K is the number of nonzero states) should not generally be a big issue.

However, this just generates the distribution of free energies alternate free energies. We also want to be able to bootstrap expectations and potentials of mean forces, as well as perturbed free energies.

We could just run the bootstrapping every single time we are interested in new expectations, but that seems wasteful and expensive if we generate a lot of them.

If they are observables with NEW samples, we should just bootstrap everything over again. But I don't think we support that currently, so we can assume that the samples are fixed between calls to init (aside: do we actually enforce that?)

If they are observables with no new sampled states, we should not need to regenerate the bootstrapped free energies, as they will not be affected. But we would need to save the random numbers, so that we know which energies are in the samples.

However, that is memory intensive, as we need Nboots x Ntotal number of samples, which could be quite large, frequently larger that the u_kn matrix itself (Whenever nboots > K).

Option for making it repeatable is to save the state of the random number generator right before generating the bootstrap samples; then one can regenerate the same indices later, and redo the same bootstrapping for the perturbed free energies and expectations.

This approach seems like it might work, then:

Initialize bootstrap free energes in init. Save the random state.
When computing statistics, used the ensemble of bootstraps (size KxNboots, shouldn't be too big) to compute the statistics in the associated call to MBAR.
When computing expectations, regenerate the indices that specify the ensemble of bootstrap samples, and bootstrap the expectations with the corresponding free energies.

Does this sound reasonable?

mrshirts commented 6 years ago

Starting to work on this in #302

jchodera commented 6 years ago

This sounds like a reasonable approach, except instead of generating the bootstrap samples within __init__, I'd use lazy instantiation and generate them only when needed. The reason for this is that there are use cases where we just want to get a rapid estimate of the f_k for some other purpose, and may not need the uncertainties at that time.

You can create some private methods to retrieve either the bootstrapped f_k or sample indices as @property decorated methods, such as self._bootstrap_f_k and self._bootstrap_indices. These can call a bootstrap method like self._generate_bootstrap_indices() if they haven't already been generated and cached.

mrshirts commented 6 years ago

Sounds good. I'll get this working on expectations first, and then try this. Note I don't think we want to store the bootstrap indices, since they are Nxnbootstraps, which is likely usually bigger than u_kn. But if can store the random seed state, then we can regenerate them very quickly.

jchodera commented 6 years ago

Good point. The main idea was just too use lazy instantiation so bootstrapping isn't done if it isn't needed.

One other concern might be serialization: if the random number sequence isn't reproducible on different hardware or OS versions, we want to regenerate the bootstrap samples if the object is unpickled, or just not cache the bootstrap samples when pickled. I think that means you may need to remove some items from the object dictionary when getstate is called.

mrshirts commented 6 years ago

One other concern might be serialization: if the random number sequence isn't reproducible on different hardware or OS versions

Correct, this could be a problem. MBAR doesn't currently pickle anything, that I'm aware of, so we could just document that to warn people in the future.

1) Don't actually call bootstrapping on init, wait until it is actually requested. 2) Save the free energy ensemble and the random number state into the MBAR object, which should make it possible to calculate additional observables w/o resolving all the free energies. 3) any serialization should unflag the bootstrapping from being done. So that it would get regenerated (though it's not clear where).

jchodera commented 6 years ago

For (3), you'd just need something like this:

    def __getstate__(self):
        # Copy the object's state from self.__dict__ which contains
        # all our instance attributes. Always use the dict.copy()
        # method to avoid modifying the original state.
        state = self.__dict__.copy()
        # Remove the bootstrap cache (or whatever it's called)
        del state['_bootstrap_cache']
        return state

mrshirts commented 6 years ago

OK, looks like I was wrong. At certain points, we have to have the bootstrapped log weights for each sample, which involves either 1) storing copies of the log weights (nstates x nsamples) or 2) storing copies of the random integers generated (also nstates x nsamples), since we need the random integers to select out the correct copies of the u_kn matrix

If that's the case (and I don't discover any other issues), I probably will try to go with the storing the log weights themselves, since it's a bit faster and simpler to store that extra data rather than storing the integers and recomputing. If we use a standard recommendation of 200 bootstraps, that would result in needing to store 200 x nstates x nsamples numbers, which might in some unusual but not crazy situations needed 200 x 200 states x 10,000 samples/state = 400,000,000 doubles, which is 3.2 GB memory. Which could cause problems sometimes for large problems. Storing the random ints would save 50% space (32 bit vs 64 bit), but would still fail at problems the same order of size.

What we could do is make caching the intermediate calculations the default, so it only needed to be done once, with an option to switch to regeneration for each calculation, which would be much less memory intensive (though more computationally intensive), and have recommendations / documentation about this.

mrshirts commented 6 years ago

One other thought about bootstrap: there are some calculations for which this type of bootstrapping won't work. For example, if someone is doing BAR consecutively among states (1->2, 2->3, 3->4) then one would need to bootstrap the ENTIRE process, not the individual steps (because the calculations are correlated). There is no possible way (that I can think of) to enforce this in a calculation, as people are free to do however complicated a process as possible. That can only really be fought with good documentation and examples, or through other tools, like alchemlyb that use MBAR.

choderalab / pymbar

Options to do bootstrapping #297