Open agrossfield opened 7 years ago
Nature-2017-statistics-discussion.pdf
@agrossfield I was just reading through this valuable comment. You may find the attached paper of interest.
Broadly, you are raising the issue of hypothesis testing, which we don't address at all currently - perhaps because none of us is an expert in this. (I suppose we address the issue a bit implicitly by emphasizing confidence intervals.) We could consider bringing in another author who could address common tests and the issue of p-hacking.
My own uneducated worry about bashing p-hacking is that it is difficult to know up front what observables will be important ... so how is it possible to avoid something that looks like p-hacking?
@dwsideriusNIST and @mangiapasta what do you think? Should we try to bring in someone to address these issues? Suggestions/names?
I'm not sure we should necessarily bring someone new into the fold on this issue specifically.
It seems that part of the problem is that data is being used to determine significance of a hypothesis without an independent test or set of new data to check that the conclusion is meaningful.
Another way to state the problem is that we haven't tested the reproducibility of the conclusion. To address this problem, folks often do a "leave-one-out" analysis where they use a portion of the data to assess, e.g. the significance of a prediction or the difference between two models. Then, you use the rest of the omitted data to show that the previous conclusions in fact predict what the omitted data shows. Often this is done by iteratively leaving out different parts of data sets. I can find references for this if need be. My explanation probably isn't super clear.
This often goes by the name of cross-validation:
Perhaps the 'living article' and github platform/authorship model offer a punt. For now, there doesn't seem to be sufficient consensus on whether or how to address p-hacking. If that changes, we can update our article. That is, we should leave this issue open.
I had an interesting discussion today at lunch with Dave Mathews that I thought might deserve discussion here. We were talking about the way that analysis of MD simulations is done iteratively, and he pointed out that this is the way p-hacking occurs.
Imagine we have two related simulations (or sets of simulations) we're comparing: we try one thing, and the p-value says they're not significantly different, so we try another, and another, and we focus on the quantities that say the two simulations are different. How is that not p-hacking? I know there are ways to handle this correctly (I think you have to essentially lower the p-value for significance with each successive test, though I don't know the theory), but I've never heard of anyone actually doing this in the context of MD.
Does someone with a stronger stats background than mine have a suggestion for what the best thing to do is?