How to use these datasets?

kyleabeauchamp commented 10 years ago

So it seems like for most of these datasets, there's no "right" answer, at least when compared to analytical test cases. That brings up the questions of how we can use these tests in an automated test framework.

The second issue that I'm seeing is that these tests essentially involve running python scripts that involve ~1000 lines of IO, preprocessing, analysis, and output. The scripts are not something that will be easy to integrate into an automated test framework.

kyleabeauchamp commented 10 years ago

I guess the first thing we should do is figure out how to port the scripts to pymbar 2.0. The easiest way may be for me to write an mbar1.0 compatability object that exactly reproduces the API of pymbar1.0, but calls pymbar2.0 code under the hood.

kyleabeauchamp commented 10 years ago

For example, there's the issue of U_kln versus U_kn. It would take considerable time to rewrite all the scripts here to format the data into the new format, so a compatibility layer might be key.

kyleabeauchamp commented 10 years ago

I also think we might want to consider looking for more simple test cases where there are unambiguous right answers, either analytical or numerical.

jchodera commented 10 years ago

I would prefer our approach to be:

Analyze a dataset to see the problem someone is describing
Figure out how to recapitulate that problem in a synthetic dataset
Add that synthetic dataset to our tests

As a minimal alternative, we can just make sure the code runs on these datasets, but that is a very low bar.

jchodera commented 10 years ago

Is @mrshirts subscribed here?

kyleabeauchamp commented 10 years ago

Yes

kyleabeauchamp commented 10 years ago

I agree with the synthetic dataset stuff. IMHO I'm just overwhelmed by the idea of us maintaining thousands of lines of user-contributed code as part of our testing protocol.

jchodera commented 10 years ago

On Nov 26, 2013, at 5:05 PM, kyleabeauchamp notifications@github.com wrote:

I agree with the synthetic dataset stuff, though. IMHO I'm just overwhelmed by the idea of us maintaining thousands of lines of user-contributed code as part of our testing protocol.

I agree completely. There's no way we can possibly do that.

There may still be a few large datasets that we would like the code to work on or at least give consistent answers on, such as the large trypsin datasets that Michael has generated. But this seems like a low priority goal over testing systems with analytical results.

I still need to code to some analytically tractable systems for binding affinity calculations. Those could be included in our tests as well if we feel we need more diversity than just harmonic oscillators.

John

mrshirts commented 10 years ago

Hi, all-

Busy all day with classes and meetings! I'm adding these datasets because they represent hard cases and/or interesting applications that use a lot of data.

In all cases, there is a script that is currently working that can be run to produce output. So at least on a high-level, one just needs a script that calls those scripts, and inspects the output -- the only customizable things are the filenames and the names of the output files. These are not going to be things that are used in nightly regression tests, or even downloaded by most users.

I don't think we want or need to maintain these things, other than perhaps altering the call to pymbar (and I'm happy to do that as long as they are working) They do represent hard problems that we'd like to manage. For example, the gas-properties is a memory hog, and we'd love to reduce that. the 8proteins case is a case where the free energy range requires that the weights be stored in the log case, because otherwise you have exp(large negative number) * exp(large positive number) = 0 because exp(large negative number) = 0 to machine precision.

Going back to a question that kyle asked earlier; I suspect that in the iterative cases, we can probably do the solutions in the exponential domain, and then store in the log domain (though this needs to be tested). So when doing an expectation we would do:

A = \sum exp(log W_n + log A_n).

Where W_n is the mixture distribution weight of sample n.

This would incur the cost of exponentials each time, but it's not an iterative cost at least.

It's possible that one could have some way to test which version would be used if fast enough if both log and exponential versions are stored. I've defaulted to only storing log, but that may not be that costly.

Free energies of unsampled states would be

f_new = -log \sum exp(log W_n - u_newn).

Where A_n has been transformed to always be greater than 1.

Note that if we keep a legacy routine (of any flavor) that does everything in the log domain, we can always test new extreme cases easily.

On Tue, Nov 26, 2013 at 11:39 AM, John Chodera notifications@github.comwrote:

On Nov 26, 2013, at 5:05 PM, kyleabeauchamp notifications@github.com wrote:

I agree with the synthetic dataset stuff, though. IMHO I'm just overwhelmed by the idea of us maintaining thousands of lines of user-contributed code as part of our testing protocol.

I agree completely. There's no way we can possibly do that.

There may still be a few large datasets that we would like the code to work on or at least give consistent answers on, such as the large trypsin datasets that Michael has generated. But this seems like a low priority goal over testing systems with analytical results.

I still need to code to some analytically tractable systems for binding affinity calculations. Those could be included in our tests as well if we feel we need more diversity than just harmonic oscillators.

John

— Reply to this email directly or view it on GitHubhttps://github.com/choderalab/pymbar-datasets/issues/3#issuecomment-29308171 .

choderalab / pymbar-datasets

How to use these datasets? #3