Open mrshirts opened 10 years ago
Do we even need speed here? For example, the SVD algorithm involves an svd()
, a pinv()
, and a couple of small matrix multiplies. In my experience, these take under one second for many choices of N
and K
. IMHO there are two factors here: robustness and avoiding O(N^2) memory usage.
I propose that we modify the SVD based code to avoid O(N^2) memory usage and to use np.linalg.pinv()
. I've done this in (#105), so we just need to explore whether this works and come up with some tests. The code (below) is a trivial update of what we were already using.
elif method == 'svdk':
# Use singular value decomposition based approach given in supplementary material to efficiently compute uncertainty
# See Appendix D.1, Eq. D4 in [1].
# Construct matrices
Ndiag = np.matrix(np.diag(N_k), dtype=np.float64)
W = np.matrix(W, dtype=np.float64)
I = np.identity(K, dtype=np.float64)
# Compute SVD of W
[U, S, Vt] = linalg.svd(W, full_matrices=False) # False Avoids O(N^2) memory allocation by only calculting the active subspace of U.
Sigma = np.matrix(np.diag(S))
V = np.matrix(Vt).T
# Compute covariance
Theta = V * Sigma * np.linalg.pinv(
I - Sigma * V.T * Ndiag * V * Sigma) * Sigma * V.T
Also, I strongly advise having exactly one implementation, unless someone has a strong and well-documented (in code and literature) justification for several.
I favor keeping multiple implementations, for several reasons:
g_mbar
someday) and this can serve as a reference implementation.I'm still strongly in favor of finding a single method that we can agree on. Reasons below.
In the current code, it's even unclear what the recommended method is. If you follow the code, sometimes the docstrings say default "svd", while other times they say "svd-ev". In fact, this is a typo, because "svd-ev" isn't even one of the choices--the correct value would be "svd-ew".
K = 1000 is a non-issue with linear algebra on modern machines.
Sure, but give K = 10000 a try.
My point is that we should, if possible, find a single implementation that works for all cases. I'm fine to support several implementations if that's impossible.
Perhaps we should be focusing our efforts on the svd-ew
method rather than the svd
method--this avoids storing the NxK
matrix U of left singular values.
They literature references are partially undocumented
We could certainly fix this.
We currently don't even know which methods are expected to work. This isn't even about testing yet, this is about deprecating code that has possibly unknown failure cases. In my experience, the different methods sometimes give different answers.
They definitely sometimes give different answers because they have different numerical behaviors. I don't know if it's even clear which ones are numerically most robust.
In the current code, it's even unclear what the recommended method is. If you follow the code, sometimes the docstrings say default "svd", while other times they say "svd-ev". In fact, this is a typo, because "svd-ev" isn't even one of the choices--the correct value would be "svd-ew".
I'm not sure which one is supposed to be recommended now. Originally, it was the svd-ew
version I had come up with the the appendix of the MBAR paper, but I thought we had been using another default in most of our work since then.
I agree this isn't an ideal scenario and that we should move toward simpler and more maintainable code. I'd love it if we could make use of one method for everything---and perhaps we can? It may just be a step backwards to get rid of everything but one method if we find we need more than that.
My point is that we should, if possible, find a single implementation that works for all cases.
Actually, if we can find one that works for all cases, I guess I'm all for that.
OK, so the next question is for you and Michael to recall some of the "difficult" examples, so we can convert them into regression tests.
Do we have any 3D umbrella sampling synthetic datasets in there? I think that was one of the more difficult large-K cases.
I see a 1D US example in pymbar-examples, but no 3D
Remind me to transfer a copy of the 'large datasets from others that we don't have permission to share' when I get back to NYC.
I think we can design a synthetic dataset where we generate umbrella sampling data from a large origin-centered harmonic potential U0(x) = (K0/2) (x^2 + y^2 + z^2)
with local umbrella biasing potentials on a 3D grid Uijk(x) = (Kbias/2) [(x-i)^2 + (y-j)^2 + (z-k)^2]
, with ijk
indices running from -5
to +5
. This will give us a K=1331
state system, with the overlap between states adjustable by choice of Kbias
.
I believe the total potential U(x) = U0(x) + Uijk(x)
is also harmonic, and it should be possible to generate uncorrelated samples from each state using normal random variates.
OK, I also just added a svd-ew-kab
method to my pull request, which is just the svd-ew
style calculation using np.linalg.pinv
instead of the custom pseudoinverse
. From the docstrings, it seems like the svd-ew
approach is the current recommended approach. It also has a low memory profile, avoiding the O(KN) matrix of left singular vectors. If we can make this approach work generally, I think that would be ideal.
Did we ever implement the method in Appendix D.2 of the MBAR paper (specifically Eq. D8)? I liked this form, since it only involves manipulation and pseudoinversion of KxK matrices. I believe it requires that all K states be unique, however...
Hi, all-
Just to chime in -- first, Kyle, thanks for taking a look at this, since this was always one of the more annoying parts of the code.
Do we even need speed here?
Yes, speed has been an issue in the past if we are doing large scans. Though I think a chunk of the speed problem was actually doing K^2 operations on Theta post matrix decomposition, which has mostly been cleaned up.
In the current code, it's even unclear what the recommended method is.
Yeah, that's what we definitely need to figure out and also include in a "numerical details of MBAR paper".
I don't think there will necessarily be a single method that works for all purposes - in some case, we might prioritize memory over speed, in others, speed over memory. One could imagine that one might actually want a superfast somewhat less accurate estimate -- though all of the "less accurate" methods that we have in there are too approximate. However, all methods should have a clear purpose. If there's not a reason to include one, we should toss it.
Certainly if there is a single method that is better in all aspects, then we should use it, but we should have to show that first.
the svd-ew version
This is certainly the default in the code now.
Speed is definitely an issue.
K = 1000 is a non-issue with linear algebra on modern machines.
We're definitely like to be able to handle larger matrices if we can. Ideally, if we can handle the number of states for the free energies, we'd like to compute the covariance matrix.
They literature references are partially undocumented
Then this would be something to make sure that we have covered in a paper.
We currently don't even know which methods are expected to work. This isn't even about testing yet, this is about deprecating code that has possibly unknown failure cases.
The are in the code because they are expected to work. All new methods have possible unknown failure cases -- that's why we need to complete the previously non-rigorous testing that didn't get finished because it was low priority.
In my experience, the different methods sometimes give different answers.
Yes, there are a few methods that are supposed to be fast and approximate. However, all of the approximate methods that are currently in there have ended up failing in the poor sampling regime (which is exactly where we need good estimators), or aren't numerically robust. I can try to dig up the emails discussing this, but basically it involved running the confidence_interval code.
Most of the implementations can be expressed as linear algebra, so IMHO there's not a huge difference in implementation complexity.
But there are always a bunch of different ways to do the same linear algebra with different algorithms with different memory and computational requirements, which is basically what we were exploring (but left in a state that leaves much to be desired and now you're cleaning up!).
I think we can design a synthetic dataset where we generate umbrella sampling data from a large origin centered harmonic potential U0(x) = (K0/2) (x^2 + y^2 + z^2) with local umbrella biasing potentials on a 3D grid Uijk(x) = (Kbias/2) [(x-i)^2 + (y-j)^2 + (z-k)^2], with ijk indices running from -5 to +5. This will give us a K=1331 state system, with the overlap between states adjustable by choice of Kbias.
I believe the total potential U(x) = U0(x) + Uijk(x) is also harmonic, and it should be possible to generate uncorrelated samples from each state using normal random variates.
I thought I had code for this somewhere -- let me try to dig it out. This test should work decently well for testing purposes. I think we'd like to keep pumping up states until this fails, so we know the large K limits for both speed and memory usage.
On Sun, Aug 17, 2014 at 4:54 PM, kyleabeauchamp notifications@github.com wrote:
OK, I also just added a svd-ew-kab method to my pull request, which is just the svd-ew style calculation using np.linalg.pinv instead of the custom pseudoinverse. From the docstrings, it seems like the svd-ew approach is the current recommended approach. It also has a low memory profile, avoiding the O(KN) matrix of left singular vectors. If we can make this approach work generally, I think that would be ideal.
— Reply to this email directly or view it on GitHub https://github.com/choderalab/pymbar/issues/87#issuecomment-52434676.
Did we ever implement the method in Appendix D.2 of the MBAR paper (specifically Eq. D8)? I liked this form, since it only involves manipulation and pseudoinversion of KxK matrices. I believe it requires that all K states be unique, however...
It's supposedly in as method "inverse". I haven't tested performance recently, though.
Here's a speed benchmark for the method svd-ew-kab
:
name K N time
0 25x100 exponentials 25 2500 0.00436
1 100x100 exponentials 100 10000 0.0322
2 250x250 exponentials 250 62500 0.485
3 25x100 oscillators 25 2500 0.00337
4 100x100 oscillators 100 10000 0.0288
5 250x250 oscillators 250 62500 0.579
6 500x100 oscillators 500 50000 1.31
7 1000x50 oscillators 1000 50000 3.48
8 2000x20 oscillators 2000 40000 12.2
9 4000x10 oscillators 4000 40000 80.1
10 500x100 exponentials 500 50000 0.967
11 1000x50 exponentials 1000 50000 3.03
12 2000x20 exponentials 2000 40000 13.5
13 4000x10 oscillators 4000 40000 71.8
14 gas-properties 155 240048 1.4
15 8proteins 28 112000 0.107
For the record, here are some representative timings for the various linear algebra operations on a 5000x5000 matrix. As expected, eigenvalues are cheap, SVD / pinv are expensive.
In [3]: %time l,v = np.linalg.eigh(x)
CPU times: user 1min 26s, sys: 1.02 s, total: 1min 27s
Wall time: 23.4 s
In [5]: %time xinv = np.linalg.pinv(x)
CPU times: user 4min 46s, sys: 2.5 s, total: 4min 48s
Wall time: 1min 15s
In [6]: %time u,d,v = np.linalg.svd(x)
CPU times: user 4min 27s, sys: 2.32 s, total: 4min 30s
Wall time: 1min 9s
In [7]: %time xinv = np.linalg.inv(x)
CPU times: user 27.3 s, sys: 373 ms, total: 27.7 s
Wall time: 10.5 s
Here are the covariance timings with the inverse
method. Note that is with the fix to #106, as otherwise these systems would blow up from memory usage.
name K N time
0 25x100 exponentials 25 2500 0.0175
1 100x100 exponentials 100 10000 0.0347
2 250x250 exponentials 250 62500 0.798
3 25x100 oscillators 25 2500 0.00301
4 100x100 oscillators 100 10000 0.0358
5 250x250 oscillators 250 62500 0.76
6 500x100 oscillators 500 50000 1.71
7 1000x50 oscillators 1000 50000 5.46
8 2000x20 oscillators 2000 40000 16.3
9 4000x10 oscillators 4000 40000 55.5
10 500x100 exponentials 500 50000 1.68
11 1000x50 exponentials 1000 50000 5.63
12 2000x20 exponentials 2000 40000 15.1
13 4000x10 oscillators 4000 40000 45.7
14 gas-properties 155 240048 1.9
15 8proteins 28 112000 0.143
Conclusion: the more general svd-ew-kab
method is only 2X slower than the method that relies on the numerical inverse (and unique states), so there's not that much lost by defaulting to the fully general method.
Also, with the inverse method, these systems all give warnings about singularity, despite having unique states.
Warning: W'W appears to be singular, yet 'inverse' method of uncertainty estimation requires W contain no duplicate states.
Things are looking good for the svd-ew-kab
method! I guess you may be
right that we can get by with one approach.
If we ditch the other methods, we could always go back and pull them back in if we find new tests start failing. But I like the "one method" more and more.
How much of a departure is this scheme from Appendix D.1? On Aug 18, 2014 10:41 AM, "kyleabeauchamp" notifications@github.com wrote:
Also, with the inverse method, these systems all give warnings about singularity, despite having unique states.
Warning: W'W appears to be singular, yet 'inverse' method of uncertainty estimation requires W contain no duplicate states.
— Reply to this email directly or view it on GitHub https://github.com/choderalab/pymbar/issues/87#issuecomment-52527460.
It's actually discussed in that appendix, equation D5.
Thanks for making these measurements!
svd-ew-kwd is just the old svd-ew with the numpy linalg pinv instead of the homebrew pseudoinverse, correct?
How does 'inverse' scale in terms of memory usage? That's the one place where it seems like it might have an advantage if N is large.
How similar are the results of inverse and svn-ew-kwb? Are we giving an inaccurately sensitive test of singularity, or are the results actually affected significantly by numerical error?
Yes, kab is just the svd-ew + np.linalg.pinv
.
Inverse and svd-ew should have the same memory footprint--both operate entirely on KxK matrices.
I can't comment on the accuracy / singularity yet, haven't looked into it.
For example: method == 'tan-HGH' Doesn't seem to be working that well. Is it a bug, or not a very good method for estimating variance?
There probably should be ~2 methods -- the generally fastest one, and the generally most robust one.