choderalab / pymbar

Python implementation of the multistate Bennett acceptance ratio (MBAR)
http://pymbar.readthedocs.io
MIT License
240 stars 93 forks source link

Uncertainty computation methods are not documented anywhere #220

Open jchodera opened 8 years ago

jchodera commented 8 years ago

The docstring for getFreeEnergyDifferences() says of uncertainty_method:

      uncertainty_method : string, optional
 |          Choice of method used to compute asymptotic covariance method,
 |          or None to use default.  See help for computeAsymptoticCovarianceMatrix()
 |          for more information on various methods. (default: svd)

Several other functions say the same thing.

Unfotunately, it seems that computeAsymptoticCovarianceMatrix() has been moved to the private API, so this isn't actually documented anywhere.

I think we should stop returning Theta_ij and once again make this function part of the public API.

mrshirts commented 8 years ago

Methods should definitely be documented better.

We can probably adjust things so that one can fetch Theta_ij using another function instead of the main computeAsyptoticCovarianceMatrix function. I will try to take a look in the next couple days and make some recommendations (things piling up because of basement flooding this weekend).

jchodera commented 8 years ago

Maybe we should just start making a wishlist for pymbar 4.0 at this point? For example, we should put in the new stochastic and local schemes from Zhiqiang Tan:

Once we decide what we want, we can plow into this and get a release out in the course of a month over the summer. This will also be great training for students who are involved in that project.

mrshirts commented 8 years ago

I've implemented the stochastic method already in my github clone(still playing around, so haven't contributed it in yet). The crossover point where it gets faster than the standard approach is a VERY large problem; basically, it's larger than can be stored in memory in a normal computer. i.e. if you can store u_kln, then standard is faster than stochastic. The stochastic scheme paper really could have explained that more clearly, i.e. scaled up and showed explicitly the crossover point for a typical problem.

Stochastic is It's only better for problems where you calculate the energies at the same time as you solve the problem, so it will be hard to put in the same framework.

This will also be great training for students who are involved in that project.

I don't have anyone assigned to that right now -- Levi is helping to maintain the code, but he's working on other projects right now.

mrshirts commented 8 years ago

Maybe we should just start making a wishlist for pymbar 4.0 at this point?

Though I do think a wishlist for 4.0 would be a good idea. What I would imagine is 1) a discussion of efficiency and robustness of solutions (the paper Kyle started and is close-ish to be done), which will involve no more functionality than is in there now (maybe the stochastic method), but some code cleaning released as 3.0, and 2) a 4.0 version which will be released along with a general guide to reweighting (review paper that we've been planning for a while -- I'm clearing room once the semester finishes to finish that), and where the list of what to add is wide open.

How does that sound?

jchodera commented 8 years ago

Can we bring in Bryce? Since forcefield parameterization is going to rely so heavily on MBAR, it's important that we have the forcefield folks up to speed on the current API, and spending a few weeks with us helping clean it up is a good way to do that.

jchodera commented 8 years ago

Though I do think a wishlist for 4.0 would be a good idea. What I would imagine is 1) a discussion of efficiency and robustness of solutions (the paper Kyle started and is close-ish to be done),

I don't see that going anywhere. Without @kyleabeauchamp pushing to get that done, and with the robustness issues we've run into, I favor rolling back to robust algorithms we know work (like the adaptive scheme) even if they are a few percent slower, keeping the u_nk mechanism, adding the stochastic solver, and cleaning up the API.

Just now I was trying to estimate free energy differences for a small problem and got a bunch of zeros out for the covariance matrix, indicating it failed to produce something numerically useful and fell back to a matrix of all zeros instead...

mrshirts commented 8 years ago

I think @bmanubay might be a good fit longer term, though I think that might be later this spring/summer, just in terms of getting up to speed. "helping clean it up" does require a bit of perspective of the past of the code, which takes some time.

mrshirts commented 8 years ago

I don't see that going anywhere. Without @kyleabeauchamp pushing to get that done, and with the robustness issues we've run into, I favor rolling back to robust algorithms we know work (like the adaptive scheme) even if they are a few percent slower, keeping the u_nk mechanism, adding the stochastic solver, and cleaning up the API.

For larger systems, the old methods are MUCH slower, not just a few percent. Levi, could you comment here? You've tried both of them.

Just now I was trying to estimate free energy differences for a small problem and got a bunch of zeros out for the covariance matrix, indicating it failed to produce something numerically useful and fell back to a matrix of all zeros instead...

Levi can definitely take a look at robustness issues; he knows the existing code quite well. Can you post an example?

The stochastic solver requires a separate code path, as I mentioned, since if you can write it as a u_nk matrix (or u_kln matrix), standard is significantly faster. stochastic is only better for problems that are so larger that you have to calculate the K states on the fly.

Now, it SHOULD be included eventually, I'm just pointing out that it can't use the same API as the standard method.

mrshirts commented 8 years ago

I do need to boost this all up higher on the todo list.

jchodera commented 8 years ago

I think we're still looking for a June-ish timeframe to tackle all of this---maybe right after the June 5 NIH deadline.

mrshirts commented 8 years ago

I will try to get cleaning of existing code done well before. Extensions, yeah, will probably be after June 5.

Lnaden commented 8 years ago

For larger systems, the old methods are MUCH slower, not just a few percent. Levi, could you comment here? You've tried both of them.

I'd have to create a concrete example for a quantitative number, but from experience I can say the SciPy solver felt faster than self-consistent iteration/Newton-Raphson answer. This was on data sets of K > 100. IIRC a self-consistent step was slower than an NR step in the adaptive method and once the solver flipped to NR it finished rather quickly. The old adaptive method didn't cause odd numerical errors such as #198 and #212, although I think those can be fixed with #202.

Let me set up a test with a data set that has >200 states to get a quantitative number for you.

Lnaden commented 8 years ago

So here is the timing from old MBAR (adaptive self-consistent iteration/NR) vs new MBAR (spicy solver with MKL acceleration) as a function of states. I tried to get more data for old MBAR overnight, but it only made it about another 3-5 datum points out of the 20 or so leading up to 200 states, so I stopped it and just went with this image where I stopped timing the old MBAR for more than 85 states. There are about 560 samples per state.

mbartiming