choderalab / pymbar

Python implementation of the multistate Bennett acceptance ratio (MBAR)
http://pymbar.readthedocs.io
MIT License
241 stars 92 forks source link

Convert to N x K storage for U_kln matrix #3

Closed mrshirts closed 10 years ago

mrshirts commented 11 years ago

For data sets where we have different numbers of samples at different states, rather than having the energies stored in a KxKxN_each matrix, it would be more efficient to store data as a KxN_tot matrix, where N is the total number of sampled collected, and K is the number of states we are evaluating the energies at. This will take a bit of extra work, however,

kyleabeauchamp commented 11 years ago

So interestingly, we recently ran into a possibly similar issue in MSMBuilder. We previously stored the state assignments as a (num_trajectories, max_length) numpy array. This led to over-allocation of huge blocks of memory when we had data sets where some trajectories were very long, but others were very short (e.g. combining FAH and ANTON data).

Our proposed solution (hasn't been merged yet, but probably will eventually be) was to use Pandas to store the data in a different format (https://github.com/SimTk/msmbuilder/pull/206 and https://github.com/SimTk/msmbuilder/issues/205). Instead of a 2D array, we use a Pandas Dataframe that has fields for "trajectory ID", "timestep", and "conformational state".

I suspect a similar thing could be done in pymbar--we just create a DataFrame object that has entries for "sample ID", "sampled state", etc.

kyleabeauchamp commented 11 years ago

Obviously that requires an extra prerequisite, but it could be an elegant solution if this problem is a real efficiency breaker.

jchodera commented 11 years ago

I like this idea---it's effectively a sparse representation.

Optimally, I think we would implement a simple linear storage system that doesn't require pandas (all N samples are concatenated and evaluated at all K states) and a pandas-interoperable interface, so that either form could be passed in.

The big question is whether we can deprecate the old interface where an Nmax x Nmax x K matrix u_kln is passed in. I imagine this would be a bad idea, and we can simply support both as input (but use NxK internally) and determine which one is being passed by the dimension.

mrshirts commented 11 years ago

I think the first step is to convert to an N x K representation, using the N_k data to identify where we transition between samples from state 1, state 2, etc. I have some ideas on this, but not sure when I'll have time to do too much.

On Sat, Aug 10, 2013 at 9:02 PM, John Chodera notifications@github.comwrote:

I like this idea---it's effectively a sparse representation.

Optimally, I think we would implement a simple linear storage system that doesn't require pandas (all N samples are concatenated and evaluated at all K states) and a pandas-interoperable interface, so that either form could be passed in.

The big question is whether we can deprecate the old interface where an Nmax x Nmax x K matrix u_kln is passed in. I imagine this would be a bad idea, and we can simply support both as input (but use NxK internally) and determine which one is being passed by the dimension.

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/3#issuecomment-22450344 .

jchodera commented 11 years ago

This should be straightforward, but will require some time investment to complete and test. We should only tackle this once we have a sufficient number of tests in place to ensure we haven't screwed anything major up.

mrshirts commented 11 years ago

Agreed on all counts mentioned.

On Sun, Aug 11, 2013 at 8:36 PM, John Chodera notifications@github.comwrote:

This should be straightforward, but will require some time investment to complete and test. We should only tackle this once we have a sufficient number of tests in place to ensure we haven't screwed anything major up.

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/3#issuecomment-22468659 .

kyleabeauchamp commented 11 years ago

I agree. Once tests are in place, it would be easy to hammer this out. On Aug 11, 2013 6:26 PM, "mrshirts" notifications@github.com wrote:

Agreed on all counts mentioned.

On Sun, Aug 11, 2013 at 8:36 PM, John Chodera notifications@github.comwrote:

This should be straightforward, but will require some time investment to complete and test. We should only tackle this once we have a sufficient number of tests in place to ensure we haven't screwed anything major up.

— Reply to this email directly or view it on GitHub< https://github.com/choderalab/pymbar/issues/3#issuecomment-22468659> .

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/3#issuecomment-22469468 .

kyleabeauchamp commented 11 years ago

Below, $q_{kj}$ refers to the unnormalized probability of conformation j in state k. (e.g. N x K storage)

So I've been playing with this a bit and have some initial thoughts regarding acceleration:

  1. If we switch from log-space to exponential space (e.g. from $f_i$ to $c_i$) (Eqns. C1, C2), the self-consistent equations of MBAR become essentially just two matrix multiplications.
  2. We can still handle overflow by dividing $q_{kj}$ by the largest element in each row (e.g. the most probable state for each sample).
  3. We can't handle underflow and still use fast BLAS / matrix multiplication--underflow requires Kahan summation, which is generally not a BLAS accelerated operation. Note that I don't think the current code actually handles underflow--just overflow. Correct me if I'm wrong here.
  4. My implementation of these ideas (with self-consistent iteration) gives a 100X speedup over the current C++ Newton code. Also, this requires only like 10 lines of code.
  5. If 100X is fast enough, do we still need to have the Newton code and c++ code? Do we want to run calculations for n_states > 10000? In such cases, we can't use NR because of the memory footprint of the Hessian--right? One possible solution is to transform the self-consistent equations into a nonlinear optimization problem (e.g. possibly using L2 error) and apply LBFGS. I'm not sure if this is worth the extra code required, though.

I can give more quantitative results / PR later.

mrshirts commented 11 years ago

Hi, all-

Perhaps you can check code in as a branch so we can look at it directly?

So I've been playing with this a bit and have some initial thoughts regarding acceleration:

Great!

  1. If we switch from log-space to exponential space (e.g. from $f_i$ to $c_i$) (Eqns. C1, C2), the self-consistent equations of MBAR become essentially just two matrix multiplications.

There are some places where the log space representation really is necessary for precision. Most was originally in exponential space, and more and more got shifted over time to log space. I'd have to dig in a bit more. Is it possible to shift to exponential space just before the time consuming calculations?

  1. We can still handle overflow by dividing $q_{kj}$ by the largest element in each row (e.g. the most probable state for each sample).

I don't THINK this is sufficient to save the error -- but we'd have to check.

  1. We can't handle underflow and still use fast BLAS / matrix multiplication--underflow requires Kahan summation, which is generally not a BLAS accelerated operation. Note that I don't think the current code actually handles underflow--just overflow. Correct me if I'm wrong here.

I would have to look a little more carefully. I think we don't have to worry about underflow.

  1. My implementation of these ideas (with self-consistent iteration) gives a 100X speedup over the current C++ Newton code. Also, this requires only like 10 lines of code.

Great!

  1. If 100X is fast enough, do we still need to have the Newton code and c++ code?

NR is MUCH faster than self consistent solution near the answer. It might take 8 steps where self-consistent iteration still hasn't converged after 8000. Of particular worry is cases where the delta F is below the specified limit -- but each dF is correlates, so you could still be 0.01 away from the solution even though the increment has dropped below 0.00001, for example. So NR is still good for now. Don't remove NR even because of a factor of 100, it's worth more than that in many cases. If you can replace the old self-consistent version w/o removing NR, then more power, of course.

  1. Do we want to run calculations for n_states > 10000? In such cases, we can't use NR because of the memory footprint of the Hessian--right?

Longer term, we do want to handle more states. However, we really only need this for states for which there are samples; states for which there are not samples do not require a Hessian matrix to be generated. The maximum size is K_sampled x K_sampled.

The likely number of cases where we want to handle many states are when we are exploring a highly multidimensional space, and n_states >> n_samples, in which case we would be limited by the number of samples instead. And I think we wan to think about these a bit differently.

  1. One possible solution is to transform the self-consistent equations into a nonlinear optimization problem (e.g. possibly using L2 error) and apply LBFGS. I'm not sure if this is worth the extra code required, though.

I think this is a great idea, but I think it can wait until we've stabilized the current changes a bit more.

mrshirts commented 11 years ago

One possible solution is to transform the self-consistent equations into a nonlinear optimization problem (e.g. possibly using L2 error) and apply LBFGS. I'm not sure if this is worth the extra code required, though.

Note that there is an objective function that when minimized, it gives the free energies. See the function _objectiveF. Certainly can be written more efficiently.

On Wed, Sep 4, 2013 at 3:33 PM, kyleabeauchamp notifications@github.comwrote:

Below, $q_{kj}$ refers to the unnormalized probability of conformation j in state k. (e.g. N x K storage)

So I've been playing with this a bit and have some initial thoughts regarding acceleration:

  1. If we switch from log-space to exponential space (e.g. from $f_i$ to $c_i$) (Eqns. C1, C2), the self-consistent equations of MBAR become essentially just two matrix multiplications.
    1. We can still handle overflow by dividing $q_{kj}$ by the largest element in each row (e.g. the most probable state for each sample).
    2. We can't handle underflow and still use fast BLAS / matrix multiplication--underflow requires Kahan summation, which is generally not a BLAS accelerated operation. Note that I don't think the current code actually handles underflow--just overflow. Correct me if I'm wrong here.
  2. My implementation of these ideas (with self-consistent iteration) gives a 100X speedup over the current C++ Newton code. Also, this requires only like 10 lines of code.
  3. If 100X is fast enough, do we still need to have the Newton code and c++ code? Do we want to run calculations for n_states > 10000? In such cases, we can't use NR because of the memory footprint of the Hessian--right? One possible solution is to transform the self-consistent equations into a nonlinear optimization problem (e.g. possibly using L2 error) and apply LBFGS. I'm not sure if this is worth the extra code required, though.

I can give more quantitative results / PR later.

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/3#issuecomment-23817337 .

jchodera commented 11 years ago
  1. If we switch from log-space to exponential space (e.g. from $f_i$ to $ci$) (Eqns. C1, C2), the self-consistent equations of MBAR become essentially just two matrix multiplications. We can still handle overflow by dividing $q{kj}$ by the largest element in each row (e.g. the most probable state for each sample). There are some places where the log space representation really is necessary for precision. Most was originally in exponential space, andmore and more got shifted over time to log space. I'd have to dig in a bit more. Is it possible to shift to exponential space just before the time consuming calculations?

Note that Kyle says he can handle overflow by scaling each row. That could fix nearly all of the issues that caused us to go to log space. I think this might be worth exploring to see if underflow ends up being unproblematic.

To test this, I think we'd need to have the harmonic oscillator test be automated with a variety of different overlap conditions (good, medium, poor, very bad).

jchodera commented 11 years ago

My implementation of these ideas (with self-consistent iteration) gives a 100X speedup over the current C++ Newton code. Also, this requires only like 10 lines of code. If 100X is fast enough, do we still need to have the Newton code and c++ code? Do we want to run calculations for n_states > 10000? In such cases, we can't use NR because of the memory footprint of the Hessian--right? One possible solution is to transform the self-consistent equations into a nonlinear optimization problem (e.g. possibly using L2 error) and apply LBFGS. I'm not sure if this is worth the extra code required, though.

I'd fully support this, since it would make for code that is more easily understood, still fast, and easier to maintain.

For the really big problems, maybe we could use a scheme that doesn't require explicit construction of the full matrix, or computes the elements on the fly as needed, possibly in a pyopencl/pycuda implementation (which would be compact and fast).

mrshirts commented 11 years ago

Note that Kyle says he can handle overflow by scaling each row. That could fix nearly all of the issues that caused us to go to log space. I think this might be worth exploring to see if underflow ends up being unproblematic.

Sure, just remember that almost everything in there is there for a reason. Before we make a major switch like this, I'll need to go though the check in dates for these changes, check the emails at that time, and make sure there's not some nasty test case that we're forgetting something important for. Limiting to the exponential case, limits us to a dynamic range of of 700x (log 10^300 - log 10^-300). I believe the specific problem was looking at heat capacities for protein aggregation.

To test this, I think we'd need to have the harmonic oscillator test be automated with a variety of different overlap > conditions (good, medium, poor, very bad).

Unfortunately, harmonic oscillators are "nice" in a way that a lot of other problems are not. So just because something works with harmonic oscillators doesn't mean it will work with other problems. If it breaks harmonic oscillators, it will certainly break other problems.

The solution, of course, is to dig those problems out of the email so they can be tested as well. But given my schedule in sep/oct, I'm not going to be able to do that.

In conclusion: The todos as I see them:

  1. Fix the K x K x N representation.

This will provide make formulas simpler because we won't need to do nearly as much index juggling, and save speed with everything as well as saving memory -- in most cases where memory fails, it fails because of K x K x N memory issues, not because of K x K issues.

  1. Make our test set include all the problems that failed in the past (will require some data archeology in the emails and subversion code), so that when we have good ideas, we can tell if they will break things.
  2. Then experiment with things like only representing exponential spaces, highly sparse spaces, and removing methods that are working pretty well. If it ain't broke, be very careful about fixing it.
mrshirts commented 11 years ago

OK, I found the example that caused me to harden the code. Basically, free energy differences greater than ~710 KbT cannot be handled in the exponential regime alone. Note that this has nothing to do with overlap -- if the energy differences alone cause the free energy differences, it will still fail.

I'll try to figure out where to upload this.

kyleabeauchamp commented 11 years ago

In principle, I agree that we don't want to change things that aren't broken. However, if it leads to major code simplification and maintainability, such changes might be worthwhile.

jchodera commented 11 years ago

In principle, I agree that we don't want to change things that aren't broken. However, if it leads to major code simplification and maintainability, such changes might be worthwhile.

I agree completely.

OK, I found the example that caused me to harden the code. Basically, free energy differences greater than ~710 KbT cannot be handled in the exponential regime alone. Note that this has nothing to do with overlap -- if the energy differences alone cause the free energy differences, it will still fail.

We should create a special tests/ directory for difficult cases, provided they are small enough to include with the standard package. Otherwise, they should go into pymbar-examples or pymbar-datasets. Or maybe we can create a pymbar-tests?

kyleabeauchamp commented 11 years ago

So here's my proposed organization:

  1. Create a directory of test systems, each of which is a class that generates the necessary data
  2. Have wrappers in our nosetests that runs all the tests (some might be disabled for speed reasons, if necessary)
kyleabeauchamp commented 11 years ago

Regarding the 700 kt underflow limit, while working in log space is one solution, another solution would be to switch to float128 in situations where more precision is needed.

mrshirts commented 11 years ago

That only doubles the range to 1400 kBT. There are things that are fundamentally better to do logarithmically, and its surprisingly easy to bump up against them. Two harmonic oscillators with equal spring constants but energy offsets of 2000 kBT can still break parts of pymbar if done solely in exponential.

mrshirts commented 11 years ago

So here's my proposed organization:

Create a directory of test systems, each of which is a class that generates the necessary data

Can you clarify this? Sometimes, there will be data sets where the data cannot be generated on the fly.

Have wrappers in our nosetests that runs all the tests (some might be disabled for speed reasons, if necessary)

Yes, there should be standard tests that always run.

I'm thinking that there are really three types of data sets and examples:

  1. tests (currently existing): This is supposed to provide the automated tests. For example tests/harmonic-oscillators/harmonic-oscillators.py runs through all the functionality in pymbar and checks analytical error estimates by making sure biases are within 1-2 std. It could probably be integrated much better. The tests/timeseries/timeseries/test-statistical-efficiency code checks the timeseries.py. The tests/harmonic-oscillators/harmonic-oscillators.py is supposed to test the distribution of error, which it basically does, but doesn't test all functionality.
  2. examples (currently existing) Pedagogical examples, should be part of the main download. Are for users, and may involve small tests sets. But the main goal is to help people write code using these examples. Data sets should be small so that it can fit in the main download.
  3. large-datasets (kind of existing): This kind of already exists, and should be for larger scale testing that is 1) to expensive to automate and 2) a database of all problems we've encountered, so we can make sure that new code doesn't break weird but important cases. Stress testing ( like the energy range problem) should be here.
kyleabeauchamp commented 11 years ago

Thanks. Here's another question: do we need everything in log space, or just $fi$. For example, can we pre-compute $Q{ij} = exp(-u_{ij})$, rather than doing it over and over again during the self-consistent iteration? That's probably a large chunk of the slowdown.

mrshirts commented 11 years ago

In principle, I agree that we don't want to change things that aren't broken. However, if it leads to major code simplification and maintainability, such changes might be worthwhile.

Certainly! I'm just pointing out that it is currently easy to break the code because we don't have all of the cases explicitly laid out -- they are all in emails between John and I that need to be added as test cases in pymbar. So one should not assume that just because harmonic-oscillators.py is unchanged that a new change will not break the code right now. The situation does need to change longer term (which for me, might mean a month or two).

mrshirts commented 11 years ago

Thanks. Here's another question: do we need everything in log space, or just $fi$. For example, can we pre- compute $Q{ij} = exp(-u_{ij})$, rather than doing it over and over again during the self-consistent iteration? That's probably a large chunk of the slowdown.

That's a really good question that I would need to play around with. I THINK the answer is no, but I can't remember.

My first reaction is caution, so I'd still advocating taking the current code, changing everything to NxK, and then all subsequent proposed modifications (like this) will be much easier to make. I'd like to see how much of the problem is all the annoying is-sampled-state index dancing, and then work from there. Getting rid of that will make the code much more readable without changing any convergence properties, and then we can separately address optimizations that may result in different convergence.

I'm sorry I'm being a pain here. Clearly, the specific problems in numerical stability that resulted in the switch to log form need to be documented better. I'm just trying to make sure we minimize pain in switching back as I'm almost certain will be needed :) Though I could be wrong.

One of the reasons we haven't been worrying as much about the self-consistent speed is that NR is so much more robust as an optimization method than self-consistent iteration when it can be used (which is when states with samples < 1,0000 or so).

kyleabeauchamp commented 11 years ago

So note that in the benchmark, I think my proposed self-consistent code is actually 100X faster than the NR code--it's better than apples to apples...

kyleabeauchamp commented 11 years ago

I agree that speed is lower priority than accuracy and readability, though.

kyleabeauchamp commented 11 years ago
git clone https://github.com/kyleabeauchamp/pymbar.git
cd pymbar
git checkout refactor
kyleabeauchamp commented 11 years ago

Oops, wrong thread.

mrshirts commented 11 years ago

So note that in the benchmark, I think my proposed self-consistent code is actually 100X faster than the NR code--it's better than apples to apples...

The problem with self-consistent is that relative_tolerance means something very different. for NR, given quadratic convergence, the df_i+1 - df_i is essentially df_i+1 - df_converged. For self-consistent iteration, df_i+1 - df_i =/= df_i+1 - df_converged, since it's sub-linear convergence. These are issues that only appear with nasty (but still common) cases.

jchodera commented 11 years ago

The problem with self-consistent is that relative_tolerance means something very different. for NR, given quadratic convergence, the df_i+1 - df_i is essentially df_i+1 - df_converged. For self-consistent iteration, df_i+1 - df_i =/= df_i+1 - df_converged, since it's sub-linear convergence. These are issues that only appear with nasty (but still common) cases.

Zhiqiang Tan had a great idea for looking at the relative tolerance for both of these schemes in a consistent fashion: Compute the norm of the gradient of the related minimization problem. We compute this gradient in our adaptive scheme to decide whether self-consistent iteration or Newton-Raphson is getting "closer" to the converged result:

      # gradient (defined by Eq. C6 of [1])
      # g_i(theta) = N_i - \sum_n N_i W_ni
mrshirts commented 11 years ago

This does lose some of the advantages of self-consistent iteration if you have to do the gradient calculation at each step. Though perhaps we can play around with the variables so that not much extra work is required at each step.

Please don't get me wrong in all of this. I think it's great Kyle is finding these tricks. We just need to get the robustness tests in so that we don't have to rely on me remembering what things broke the code in the past leading to the somewhat more convoluted than ideal version that we have now.

On Fri, Sep 6, 2013 at 9:03 AM, John Chodera notifications@github.comwrote:

The problem with self-consistent is that relative_tolerance means something very different. for NR, given quadratic convergence, the df_i+1 - df_i is essentially df_i+1 - df_converged. For self-consistent iteration, df_i+1 - df_i =/= df_i+1 - df_converged, since it's sub-linear convergence. These are issues that only appear with nasty (but still common) cases.

Zhiqiang Tan had a great idea for looking at the relative tolerance for both of these schemes in a consistent fashion: Compute the norm of the gradient of the related minimization problem. We compute this gradient in our adaptive scheme to decide whether self-consistent iteration or Newton-Raphson is getting "closer" to the converged result:

  # gradient (defined by Eq. C6 of [1])
  # g_i(theta) = N_i - \sum_n N_i W_ni

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/3#issuecomment-23938121 .

mrshirts commented 11 years ago

Note also that even if the self-consistent iteration and NR are using the same test for convergence, that does not change the self-consistent iteration performance. It's just that it becomes clearer than it takes more steps.

On Fri, Sep 6, 2013 at 10:06 AM, Michael Shirts mrshirts@gmail.com wrote:

This does lose some of the advantages of self-consistent iteration if you have to do the gradient calculation at each step. Though perhaps we can play around with the variables so that not much extra work is required at each step.

Please don't get me wrong in all of this. I think it's great Kyle is finding these tricks. We just need to get the robustness tests in so that we don't have to rely on me remembering what things broke the code in the past leading to the somewhat more convoluted than ideal version that we have now.

On Fri, Sep 6, 2013 at 9:03 AM, John Chodera notifications@github.comwrote:

The problem with self-consistent is that relative_tolerance means something very different. for NR, given quadratic convergence, the df_i+1 - df_i is essentially df_i+1 - df_converged. For self-consistent iteration, df_i+1 - df_i =/= df_i+1 - df_converged, since it's sub-linear convergence. These are issues that only appear with nasty (but still common) cases.

Zhiqiang Tan had a great idea for looking at the relative tolerance for both of these schemes in a consistent fashion: Compute the norm of the gradient of the related minimization problem. We compute this gradient in our adaptive scheme to decide whether self-consistent iteration or Newton-Raphson is getting "closer" to the converged result:

  # gradient (defined by Eq. C6 of [1])
  # g_i(theta) = N_i - \sum_n N_i W_ni

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar/issues/3#issuecomment-23938121 .

kyleabeauchamp commented 10 years ago

So NxK is done. There are some useful discussions here regarding precision and performance, but I think I've refreshed my memory of the key points.