Open mrshirts opened 4 years ago
Started going through memory profiling. Once using mprofile, I identified approximately where the issues were, and then just stepped through with PDB to discover the exact commands.
Fundamentally, there are large data structures that are NxK large that are very hard to avoid. We need to know the u_kn matrix to do anything. We need to keep track of the W_kn weights.
For the u_kn matrices, fundamentally, we need the data. However, we might want to have the option to have u_kn be a callable function. Exactly how to implement it will depend on the situation. For example, if the k states are K temperatures, then it's very easy to implement such a function. But if you need to evaluate an alchemical potential it could be complicated. And accelerating such a function within the context of the mbar evaluations would be nontrivial.
The formula for the w_kn are w_kn = exp(-u_k(x_n)) / \sum k N_k exp(f-u_k(x_n). It seems like it might be possible to just store the Nx1 denominators d(x_n), and reconstruct the NxK entries of the matrix when needed in the calculation, without creating the full matrix. We should probably store the log weights, for underflow reasons, but then in that case \log w_kn = -u_k(x_n) - \log d(x_n), and we can easily keep the simple sum, which with accelerated computations should be cheap. However, rewriting this touches some of the most compute time-sensitive parts of the code, so should probably be done in conjunction with computational acceleration.
It does seem that at minimum we should not have more than one u_kn and one w_kn matrix stored, and maybe just one u_kn. So we should see if we could do some calculations in place. Any thoughts appreciated!
However, there are a number of things that could potentially be memory optimized beyond "keep one w_kn and one u_kn). I'll start keeping track of those in extra comments.
OK, so one thing that surprises me that takes extra memory: in precondition_u_kn(u_kn, N_k, f_k)
, the command:
u_kn = u_kn - u_kn.min(0)
Increases the memory by quite a bit. In this case, exactly the size of the u_kn
array (which for 500 x 500000, is 1.91 GB). So it to seems to be creating a completely new array (checked by calling u_kn.__array__
before and after, new address).
if I call it 2X, the memory increases permanently by the same 1.91 GB increment. If I KEEP calling it after that, then the memory increases temporarily by the same amount, but then drops down to what it was before. So the permanent increase is only 2x, but he address keep changing. Beats me!
Just testing, this same behavior occurs if I subtract a constant from the 2D array, instead of a 1D vector. Probably it should be min(axis=0) for clarity, but that doesn't affect the memory usage.
If I use u_kn = np.add(u_kn,-u_kn.min(axis=0))
, then the memory usage increase to 2x only occurs during the 2nd command, but only is 1x permanently.
Yeah, I have no idea why.
Ah, potential solution: np.add(u_kn,-u_kn.min(axis=0), out=u_kn)
. Apparently, the output will be a fresh array if you do not specify out, even if you are overwriting the same variable.
This seems to preserve the array location (sends the output to the same array, AND is faster (the speed issue seems to be memory allocation).
After this change, the entire process seems to run at lower memory, so it seems to be a permanent fix.
I will investigate other places this happens.
Actually:
u_kn =- u_kn.min(axis=0)
works fine for not misallocating memory (must be equivalent to out = u_kn). Just not:
u_kn = u_kn - u_kn.min(axis=0)
Which does create a whole new array. Who knew?
OK, another initial memory occupant is the function call:
f_k_nonzero, all_results = solve_mbar(u_kn[states_with_samples], N_k[states_with_samples], f_k[states_with_samples], solver_protocol=solver_protocol)
Specifically the creation of the array u_kn[states_with_samples]
, which is a different array than u_kn
. Because all of the states have samples in this example, it's simply copying the array into a new one (which is 1.9 GB). It could be in cases where all states had samples, then one should simply pass the u_kn
array, rather than create a new one. Not sure if it would be worth doing in the general case.
Another memory hog:
Line ~200 of mbar.__init__
does:
self.u_kn = np.array(u_kn, dtype=np.float64)
Which creates a new array u_kn
.
I will change this to:
self.u_kn = np.array(u_kn, dtype=np.float64, copy=False)
Which reformats (if needed) in place.
in mbar_gradient
, we have:
log_denominator_n = logsumexp(f_k - u_kn.T, b=N_k, axis=1)
log_numerator_k = logsumexp(-log_denominator_n - u_kn, axis=1)
The logsumexp
temporarily raise the memory by 2x size(u_kn), but that is temporary by the end of the function. Similar behavior with the mbar_hessian
function and self_consistent_iteration
. Since these are going to be reoptimized, it's not clear it makes sense to clean up memory for now.
OK, after some more testing, I've concluded that it's not really possible to free up memory now.
There are two places where input u_kn energy data is duplicated, and then manipulated to make the data more numerically stable in various calculations (basically, ones where one computes ratios of exponentials of energies - you need do subtract of the minimum to prevent over/underflow. If the data were entirely internal from that point forth, then the data would not need to be duplicated, but could just be copied over with the manipulated data. However, in many cases, this data is then used again after the call. You don't want to the input data to be altered during the simulation run!
The other possibility to fix this is that the numerical stabilization could be run DURING the calculations that require numerical stabilization (i.e. computing expectations), instead of done beforehand. However, it wouldn't really make sense to do this before because of the computational overload (and increase in complexity) except after the calculations are accelerated.
To see more details check closed branches #359 and #383.
When running biggish systems (500 states, 2000 samples/state), it takes between ~11 and 20 GB or so (on OS X), but only about 4 minutes. Once it exceeds the memory of the computer, then it slows down by a factor of 4-5. So for the largest systems on a standard memory computer, the time limiting factor becomes is memory swaps, not computation. So we should spend some time figuring out how to minimize memory usage.
Note there still maybe value in solving the <10GB problem in seconds, not minutes, so acceleration could still be very good!