Make README which describes what we intend here

davidlmobley commented 7 years ago

This is intended for validation/accuracy assessment of alchemical free energy calculations; much less on "Can this method converge to the known correct answer for this given system and parameters?" (as in github.com/mobleylab/benchmarksets) and much more on "How well can we do compared to experiment on this set?"

I/we need to add a README.md that clearly explains what this is for, what data we are after, etc.

jchodera commented 7 years ago

I think you have the terminology backwards here and have additionally conflated "is my algorithm correct?" with "how rapidly can I reach the correct answer for this given system and parameters?". Let's have a call here soon, since I fear we are also duplicating effort between the YANK validation/accuracy assessment/benchmark sets.

davidlmobley commented 7 years ago

I'm not compiling anything, @jchodera -- Woody was today mentioning how they are starting to pull things together for large scale validation/assessment of accuracy on Schrodinger-like sets and I mentioned how much good it would do for him to do it publicly, so we suggested getting it up on a repo. I told him I'd create one here so he could start putting things up/talking about what he's trying to do.

Terminology is open to adjustment, but you don't need to coordinate with me on it as I'm not doing anything on it at present. There are, however, a lot of OTHER people aside from your group who are starting to work on this (Silicon Therapeutics, OpenEye, and Thomas Evangelidis of the Alchemistry Slack, among others) so you may want to work on coordinating with them.

jchodera commented 7 years ago

It would behoove us to agree on a terminology that is made clear in your benchmark paper about the three kinds of datasets we need to assess free energy methods:

a validation set to make sure your algorithm/implementation can reproduce agreed-upon literature results (a test for correctness)
an efficiency benchmark to assess how quickly you can achieve a result for a given system and forcefield (to assess performance)
an accuracy benchmark to assess how quickly a protocol (which may include assigning forcefield parameters) agree with experimental data (to assess accuracy)

I think amending your paper with Gilson to clarify these approaches and lay out three distinct datasets is the best way to do this.

jchodera commented 7 years ago

For the YANK paper, we are focusing on the validation and accuracy sets, and @andrrizzi has already compiled what we believe to be a tractable set with significant utility. The original goal was to amend your paper to include this division and the corresponding datasets, but if you're not amenable to that, we can simply write our own, I suppose.

davidlmobley commented 7 years ago

@jchodera - I can comment more on this later, but one thing I'm unclear on is how one can be sure one is benchmarking accuracy (in an accuracy benchmark) without convincingly demonstrating convergence.

In the "benchmarksets" paper we dealt with these issues by noting three categories of "hard" (or you might say, "quantitative") benchmarks on systems where you can convincingly demonstrate convergence: a) Systems to test software implementations and usage -- validating correctness b) Systems to check sampling completeness and efficiency -- assessing performance c) Systems to assess force field accuracy

These basically correspond to exactly the three distinctions you laid out.

However, we also distinguished a whole other category of benchmarks, "soft benchmarks", where one might want to look at, say, how well one does at sampling certain effects, or how accurate results appear relative to experiment, WITHOUT convincingly demonstrating convergence. One could ATTEMPT to look at the same types of issues as above, but it won't necessarily work since one won't necessarily know if the results have converged. For example, the "accuracy" test might yield results that depend partly on the force field, partly on the simulation protocol/method, partly on the random number seed used, etc., and may not be reproducible, especially if significant perturbations are made to the system (a significantly different starting structure for example).

Presumably, many of the comparisons people are now considering fall in this latter category -- for example, the Schrodinger set gives reasonable results in some hands/some tests, but we are not aware of any very careful studies that have examined convergence of results in that system, so we don't actually know what the accuracy there results from.

Anyway, the point of this repo was just that people are already starting to compile much larger datasets ATTEMPTING to look at accuracy in a way which is beyond the scope of the "benchmarksets" paper (more exploratory) so it seemed good to have a separate repo which provides a playground for that and potentially feeds into the benchmark sets paper or elsewhere. I'm ambivalent about what we call such testing and open to suggestions.

For the YANK paper, we are focusing on the validation and accuracy sets, and @andrrizzi has already compiled what we believe to be a tractable set with significant utility. The original goal was to amend your paper to include this division and the corresponding datasets, but if you're not amenable to that, we can simply write our own, I suppose.

Yes, I am amenable to that.

jchodera commented 7 years ago

Anyway, the point of this repo was just that people are already starting to compile much larger datasets ATTEMPTING to look at accuracy in a way which is beyond the scope of the "benchmarksets" paper (more exploratory) so it seemed good to have a separate repo which provides a playground for that and potentially feeds into the benchmark sets paper or elsewhere. I'm ambivalent about what we call such testing and open to suggestions.

How about naming this repository soft-benchmarks, following your own terminology?

I think we'll want to reserve the term validation as one of the three categories of hard benchmarks that we are focusing on in the near term (at least for YANK). I do think we'll want to come up with some clear terminology there, perhaps:

validation / correctness (where perhaps "correctness" is the winner)
performance
accuracy

"Validation" is commonly used in software as a term interchangeable with "correctness", so it would be good to avoid that for a soft benchmark of accuracy.

davidlmobley commented 7 years ago

@jchodera - I like those proposals. I think I'm with you that "correctness" might be even better than "validation" (partly because it's clearer what we mean by "correctness" -- e.g. I called this "validation" because that was the term Woody was using for it and I hadn't attached a clear meaning to it yet in the review paper and didn't quite know what it should mean).

I will change the name of this repo to soft-benchmarks.

alchemistry / soft-benchmarks

Make README which describes what we intend here #1