alchemistry / alchemlyb

the simple alchemistry library
https://alchemlyb.readthedocs.io
BSD 3-Clause "New" or "Revised" License
195 stars 49 forks source link

Bootstrapping: should it be at the alchemlyb level, or pymbar level? #80

Closed mrshirts closed 1 year ago

mrshirts commented 5 years ago

Should bootstrapping be implemented at the alchemlyb level, or pymbar level? For MBAR, it would be better at the pymbar level, since it can be easily encapsulated (user doesn't have to worry about it), and one can request either uncertainty estimate.

For BAR over several states, then the bootstrapping needs to be done at the level ABOVE the BAR call, since we need to bootstrap all of the data simultaneously before feeding it into BAR. Same for EXP applied to a string of states.

Thoughts?

orbeckst commented 5 years ago

If there's a common function that we can use for "everything" (and just plugin the estimator) then alchemlyb would be a good place, I think – along the lines that we only have to write and test the code once.

I generally like building blocks that I can freely combine. Something like

def bootstrapped(data, estimator):
     ...
     return mean, error

Alternatively, we could hide the machinery in the Estimator.fit() function: enable it with fit(..., ..., bootstrap=True) and just do under the hood whatever needs to be done and set the appropriate output attributes. In this case, one might just use a pymbar bootstrap directly if available or roll our own for other estimators.

One advantage of doing it at the alchemlyb level is that it might not be difficult to run alchemlyb with dask (essentially, use the dask.DataFrame) and then the bootstrapping can be parallelized without effort. A while ago @dotsdl played around with alchemlyb and dask – I can't quite remember how much would need to be changed.

mrshirts commented 5 years ago

def bootstrapped(data, estimator):

That might be a good idea. It could return a dictionary of all the bootstrapped results, along with the uncertainty estimate. I'll think about how to organize this.

One issue is that the data will look different with each estimator, thus requiring fairly different conditionals inside the bootstrapped data. Also, if one was analyzing K states, calculating the free energy with BAR executed pairwise, one would want to bootstrap over the entire data set of K states; i.e. you would need to bootstrap the entire procedure, not over a single estimator.

dotsdl commented 5 years ago

@mrshirts do you have a paper or writeup you can point to for this approach? I'd be happy to prototype something. We may be able to steal design inspiration from scikit-learn such as their GridSearchCV and Pipeline objects.

mrshirts commented 5 years ago

So, I don't really have a good simple paper. http://www.alchemistry.org/wiki/Analyzing_Simulation_Results#Bootstrap_Sampling is a good summary.

I agree that something like sklearn that carries out some algorithm given an input function. In the case of bootstrapping, the metamethod takes some function of N arrays, each of which is a a set of samples from an independent distribution (or independent draw of the same distribution). Say, for example, N sets of samples, one at each lambda state; for each bootstrap you would want to bootstrap sample all N of the samples each time. (i.e. always bootstrap over ALL input data before any analysis).

After the bootstrap sampling with replacement, everything else is pretty trivial. You calculate your function on each of the bootstrapped. You then have a set of resullts (could be multivalue return, and you can simply return a list of all the answers. You can optionally return various statistical measures of this list for each of the results - mean, standard deviation, confidence intervals.

One could make decorrelation of the data sets part of the algorithm, but it would perhaps be more modular to do the decorrelation as a separate step.

wildromi commented 4 years ago

Dear alchemlyb team! It is quite awesome to have this library. We will use it in a publication I am currently preparing. For this it is crucial to have an error measure, and a built in function that carries out bootstrapping and returns the bootstrap standard deviation would be the best. Maybe one could also access the bootstrap data sets used to calculate st dev, but that is not too important. I cannot code but gladly test anything you might have. Thank you so much! Romina

dotsdl commented 4 years ago

Hey all, after discussions with @wildromi, I've committed to working on this issue over the next two weeks. I expect the first iteration to be usable but probably not the approach we end up with. I'll post a WIP PR as soon as I can.

mrshirts commented 4 years ago

Hi, David- I'd love to talk some more about this, as I've been dealing with similar setups for a while. Shoot me an email at the CU email and we can strategize some more? A key is bootstrapping simultaneously over multiple time series, for example.

dotsdl commented 4 years ago

@mrshirts sent! I'm looking forward to leveraging your experience to jumpstart the approach.

mrshirts commented 4 years ago

@dotsdl: take a look at https://github.com/choderalab/pymbar/blob/pmf/pymbar/pmf.py and look at lines 590 to 615 to get a sense at how bootstrapping works in a complicated case (in this case, calculating a potential of mean force)

dotsdl commented 4 years ago

I met with @mrshirts yesterday, and we aligned on an approach. I have started a WIP PR on #94. There is a list of things to do yet, but we have the start of our implementation. You can check out how things work so far in this gist. Comments welcome!

Please don't use this in production work yet until we have tests ensuring that bootstrap is working as expected. It appears to work from my ad-hoc tests, so at most give it a try on the feature-bootstrapping branch.

dotsdl commented 4 years ago

The gist for #94 has been updated; it requires components of #98, which can be played with on this branch.