Open jchodera opened 8 years ago
Methods should definitely be documented better.
We can probably adjust things so that one can fetch Theta_ij using another function instead of the main computeAsyptoticCovarianceMatrix function. I will try to take a look in the next couple days and make some recommendations (things piling up because of basement flooding this weekend).
Maybe we should just start making a wishlist for pymbar 4.0
at this point? For example, we should put in the new stochastic and local schemes from Zhiqiang Tan:
Once we decide what we want, we can plow into this and get a release out in the course of a month over the summer. This will also be great training for students who are involved in that project.
I've implemented the stochastic method already in my github clone(still playing around, so haven't contributed it in yet). The crossover point where it gets faster than the standard approach is a VERY large problem; basically, it's larger than can be stored in memory in a normal computer. i.e. if you can store u_kln, then standard is faster than stochastic. The stochastic scheme paper really could have explained that more clearly, i.e. scaled up and showed explicitly the crossover point for a typical problem.
Stochastic is It's only better for problems where you calculate the energies at the same time as you solve the problem, so it will be hard to put in the same framework.
This will also be great training for students who are involved in that project.
I don't have anyone assigned to that right now -- Levi is helping to maintain the code, but he's working on other projects right now.
Maybe we should just start making a wishlist for pymbar 4.0 at this point?
Though I do think a wishlist for 4.0 would be a good idea. What I would imagine is 1) a discussion of efficiency and robustness of solutions (the paper Kyle started and is close-ish to be done), which will involve no more functionality than is in there now (maybe the stochastic method), but some code cleaning released as 3.0, and 2) a 4.0 version which will be released along with a general guide to reweighting (review paper that we've been planning for a while -- I'm clearing room once the semester finishes to finish that), and where the list of what to add is wide open.
How does that sound?
Can we bring in Bryce? Since forcefield parameterization is going to rely so heavily on MBAR, it's important that we have the forcefield folks up to speed on the current API, and spending a few weeks with us helping clean it up is a good way to do that.
Though I do think a wishlist for 4.0 would be a good idea. What I would imagine is 1) a discussion of efficiency and robustness of solutions (the paper Kyle started and is close-ish to be done),
I don't see that going anywhere. Without @kyleabeauchamp pushing to get that done, and with the robustness issues we've run into, I favor rolling back to robust algorithms we know work (like the adaptive
scheme) even if they are a few percent slower, keeping the u_nk
mechanism, adding the stochastic solver, and cleaning up the API.
Just now I was trying to estimate free energy differences for a small problem and got a bunch of zeros out for the covariance matrix, indicating it failed to produce something numerically useful and fell back to a matrix of all zeros instead...
I think @bmanubay might be a good fit longer term, though I think that might be later this spring/summer, just in terms of getting up to speed. "helping clean it up" does require a bit of perspective of the past of the code, which takes some time.
I don't see that going anywhere. Without @kyleabeauchamp pushing to get that done, and with the robustness issues we've run into, I favor rolling back to robust algorithms we know work (like the adaptive scheme) even if they are a few percent slower, keeping the u_nk mechanism, adding the stochastic solver, and cleaning up the API.
For larger systems, the old methods are MUCH slower, not just a few percent. Levi, could you comment here? You've tried both of them.
Just now I was trying to estimate free energy differences for a small problem and got a bunch of zeros out for the covariance matrix, indicating it failed to produce something numerically useful and fell back to a matrix of all zeros instead...
Levi can definitely take a look at robustness issues; he knows the existing code quite well. Can you post an example?
The stochastic solver requires a separate code path, as I mentioned, since if you can write it as a u_nk matrix (or u_kln matrix), standard is significantly faster. stochastic is only better for problems that are so larger that you have to calculate the K states on the fly.
Now, it SHOULD be included eventually, I'm just pointing out that it can't use the same API as the standard method.
I do need to boost this all up higher on the todo list.
I think we're still looking for a June-ish timeframe to tackle all of this---maybe right after the June 5 NIH deadline.
I will try to get cleaning of existing code done well before. Extensions, yeah, will probably be after June 5.
For larger systems, the old methods are MUCH slower, not just a few percent. Levi, could you comment here? You've tried both of them.
I'd have to create a concrete example for a quantitative number, but from experience I can say the SciPy solver felt faster than self-consistent iteration/Newton-Raphson answer. This was on data sets of K > 100. IIRC a self-consistent step was slower than an NR step in the adaptive method and once the solver flipped to NR it finished rather quickly. The old adaptive method didn't cause odd numerical errors such as #198 and #212, although I think those can be fixed with #202.
Let me set up a test with a data set that has >200 states to get a quantitative number for you.
So here is the timing from old MBAR (adaptive self-consistent iteration/NR) vs new MBAR (spicy solver with MKL acceleration) as a function of states. I tried to get more data for old MBAR overnight, but it only made it about another 3-5 datum points out of the 20 or so leading up to 200 states, so I stopped it and just went with this image where I stopped timing the old MBAR for more than 85 states. There are about 560 samples per state.
The docstring for
getFreeEnergyDifferences()
says ofuncertainty_method
:Several other functions say the same thing.
Unfotunately, it seems that
computeAsymptoticCovarianceMatrix()
has been moved to the private API, so this isn't actually documented anywhere.I think we should stop returning
Theta_ij
and once again make this function part of the public API.