Specific observables section

dmzuckerman commented 7 years ago

@mangiapasta and @ajschult - you guys got a great draft going for the specific observables section. I have some questions/comments that hopefully you could address in the next few days:

I think this is for Paul: The first subsection addresses 'quality of data', which really means 'overall qualitative behavior of data', I think. You give an example of material failure with increasing strain. I think you should point out that this is a single observable measured (I think) over many different simulations. So we will want a separate example for a single observable measured from a single simulation ... if you put a stub paragraph in for that, I can fill it in.
Re propagation of uncertainty: I was wondering about the example of block averages for free energy estimation. If the individual blocks are pathological, isn't that itself an indicator of questionable sampling? Also, it seems that the later aveat, as well as the subsection on 'synthetic data', calls into question using a Taylor expansion for a nonlinear derived quantity. So is free energy a good example here?
Re bootstrapping: Is there an implicit assumption that the original data samples are statistically independent? Please clarify
Synthetic data: looks like this should go right after propagation of error since they discuss the same issue if I understand correctly
Dark uncertainty: Wish I had thought of that name!!

As I said, this is off to a great start. I'm nitpicking so it can be even better.

ajschult commented 7 years ago

Pathologically bad individual blocks aren't a big problem. If the blocks are small enough, then individual bad blocks are inevitable, but the approach work fine so long as the Taylor series approximation holds. I would assert that propagation of uncertainty does work well for free energy so long as the calculation as a whole is working. I have spent a lot of time trying to figure out how to estimate free energy uncertainties where propagation fails, but in the end, I abandoned those free energy methods in favor of alternatives that yielded estimates that were precise enough for propagation to work. In practice, having a nonlinear example gives an opportunity to point out the limitations.

Bootstrapping -- yes the samples need to be uncorrelated. I thought I mentioned that, but it looks like not! I've added a bit.

Synthetic data actually seems a lot like boostrapping, specifically parametric bootstrapping. In fact, based on the description, I'd have trouble saying how it is different from parametric bootstrapping. Paul, it'd help if you describe how it is different (or perhaps it is a specific way to bootstrap).

agrossfield commented 7 years ago

Re bootstrapping samples needing to be uncorrelated: You can still do bootstrapping for correlated data. However, you need to reduce the number of samples you draw to reflect this fact. If you have 10,000 frames in your trajectory, but your correlation analysis indicate you really only have ~100 independent samples, you can estimate uncertainties by drawing sets of 100 data points. It's not especially rigorous, but frankly nothing about bootstrapping is rigorous, and in my experience the estimates are sane. I used this approach in my WHAM code.

dmzuckerman commented 7 years ago

@mangiapasta and @ajschult - Andrew, thanks for the further explanation. I think the fact that the linear/Taylor propagation of error required some discussion among us (the experts!) suggests that could use a little more clarification. Also, since you are skimming the surface a bit, it would be good to cite references where readers could go for more.

@mangiapasta Paul I just wanted to point you to Andrew's question above in case you missed it about differences between bootstrapping and synthetic data analysis.

@agrossfield also implicitly raised the issue that the correlation time is really a fundamental thing for all to understand. I will add that to our key definitions

mangiapasta commented 7 years ago

I think that the synthetic data approach is in fact parametric bootstrapping. We just always called in the synthetic data approach.

The main reason why I prefer the parametric approach over others is that one can perform the analysis on correlated data and don't suffer from needing to generate smaller synthetic sets, provided one can model the correlations. So perhaps it makes sense to merge that section with bootstrapping and emphasize the differences?

From: Andrew Schultz notifications@github.com Sent: Thursday, September 28, 2017 9:32:58 PM To: dmzuckerman/Sampling-Uncertainty Cc: Patrone, Paul (Fed); Mention Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

Pathologically bad individual blocks aren't a big problem. If the blocks are small enough, then individual bad blocks are inevitable, but the approach work fine so long as the Taylor series approximation holds. I would assert that propagation of uncertainty does work well for free energy so long as the calculation as a whole is working. I have spent a lot of time trying to figure out how to estimate free energy uncertainties where propagation fails, but in the end, I abandoned those free energy methods in favor of alternatives that yielded estimates that were precise enough for propagation to work. In practice, having a nonlinear example gives an opportunity to point out the limitations.

Bootstrapping -- yes the samples need to be uncorrelated. I thought I mentioned that, but it looks like not! I've added a bit.

Synthetic data actually seems a lot like boostrapping, specifically parametric bootstrapping. In fact, based on the description, I'd have trouble saying how it is different from parametric bootstrapping. Paul, it'd help if you describe how it is different (or perhaps it is a specific way to bootstrap).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmzuckerman%2FSampling-Uncertainty%2Fissues%2F6%23issuecomment-333007126&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cc2af7b651f3e480ef6d808d506da04b0%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636422455803291802&sdata=l3jm1qLFaZ56fxIXplNyGdpuFOQXKVSynlyHhLkh%2BXE%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAd9eEVFghuPnmoewk8u6vub4Lailhgx0ks5snEjKgaJpZM4PnxtI&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cc2af7b651f3e480ef6d808d506da04b0%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636422455803291802&sdata=kduFU4Lw61WCRIcCsSM6ijL%2FrJ6wR6AeHkFy4uYuc50%3D&reserved=0.

dmzuckerman commented 7 years ago

@mangiapasta Yes, please merge and clarify then.

ajschult commented 7 years ago

@mangiapasta we were also doing parametric bootstrapping as well before we knew the method already existed. I think a merged section will work fine.

mangiapasta commented 7 years ago

Just to emphasize that the parametric approach can handle correlated data, cf. appendix in this paper

From: Patrone, Paul (Fed) Sent: Friday, September 29, 2017 10:20:09 AM To: dmzuckerman/Sampling-Uncertainty; dmzuckerman/Sampling-Uncertainty Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

I think that the synthetic data approach is in fact parametric bootstrapping. We just always called in the synthetic data approach.

The main reason why I prefer the parametric approach over others is that one can perform the analysis on correlated data and don't suffer from needing to generate smaller synthetic sets, provided one can model the correlations. So perhaps it makes sense to merge that section with bootstrapping and emphasize the differences?

From: Andrew Schultz notifications@github.com Sent: Thursday, September 28, 2017 9:32:58 PM To: dmzuckerman/Sampling-Uncertainty Cc: Patrone, Paul (Fed); Mention Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

Pathologically bad individual blocks aren't a big problem. If the blocks are small enough, then individual bad blocks are inevitable, but the approach work fine so long as the Taylor series approximation holds. I would assert that propagation of uncertainty does work well for free energy so long as the calculation as a whole is working. I have spent a lot of time trying to figure out how to estimate free energy uncertainties where propagation fails, but in the end, I abandoned those free energy methods in favor of alternatives that yielded estimates that were precise enough for propagation to work. In practice, having a nonlinear example gives an opportunity to point out the limitations.

Bootstrapping -- yes the samples need to be uncorrelated. I thought I mentioned that, but it looks like not! I've added a bit.

Synthetic data actually seems a lot like boostrapping, specifically parametric bootstrapping. In fact, based on the description, I'd have trouble saying how it is different from parametric bootstrapping. Paul, it'd help if you describe how it is different (or perhaps it is a specific way to bootstrap).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmzuckerman%2FSampling-Uncertainty%2Fissues%2F6%23issuecomment-333007126&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cc2af7b651f3e480ef6d808d506da04b0%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636422455803291802&sdata=l3jm1qLFaZ56fxIXplNyGdpuFOQXKVSynlyHhLkh%2BXE%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAd9eEVFghuPnmoewk8u6vub4Lailhgx0ks5snEjKgaJpZM4PnxtI&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cc2af7b651f3e480ef6d808d506da04b0%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636422455803291802&sdata=kduFU4Lw61WCRIcCsSM6ijL%2FrJ6wR6AeHkFy4uYuc50%3D&reserved=0.

dmzuckerman commented 7 years ago

@mangiapasta Remember if there's anything you need to clarify to us, it certainly needs to be clarified/refereced in manuscript.

mangiapasta commented 7 years ago

@dmzuckerman @ajschult Sounds good. If okay with you two I don't mind taking a first crack at merging / clarifying the boostrap section. I still need to read over the other sections, so I may have a few more comments. (Also, apologies for being a bit slow to respond; we're in the midst of an internal proposal exercise that is [thankfully] over today)

ajschult commented 7 years ago

@mangiapasta yes, please go ahead with the merge.

dmzuckerman commented 7 years ago

@mangiapasta yes, please go ahead. I would like you and @ajschult to be happy with the section before we turn it over for editing. I have some ideas for intro material for the section, but I'd like the experts to get the technical stuff straightened.

mangiapasta commented 7 years ago

@dmzuckerman Just to reply to some of your original comments:

"The first subsection addresses 'quality of data', which really means 'overall qualitative behavior of data', I think."

Yours is a better way to paraphrase the content of that section. I'll edit a bit for clarity in light of your comment.

@dmzuckerman @ajschult Dan's comment, "Synthetic data: looks like this should go right after propagation of error since they discuss the same issue if I understand correctly"

I'm not sure how to handle this comment in light of the previous discussion about connection between synthetic data and bootstrap. Does this highlight a difference between the two, or can bootstrapping also be used for uncertainty propagation? I've never used resampling methods for this purpose, and generally in my circle we use synthetic exclusively for the purposes of propagation.

Regarding Taylor expansions, a few folks here generally take the perspective that one should do both Taylor expansions and synthetic data / uncertainty propagation for the purposes of comparing the two. The benefit of this is that one can verify the consistency of the approaches when they agree in the small-noise limit while also assessing the regime of validity for the taylor expansion. Is it worth mentioning this? Taylor expansions certainly have their place, I would guess for example when numerical propagation is expensive.

dmzuckerman commented 7 years ago

@mangiapasta @ajschult @agrossfield Doesn't bootstrapping implicitly do error propagation since it provides confidence intervals for derived quantities? This issue must be in the Tibshirani book. Do we want to tell readers that bootstrapping/synthetic data are more modern and systematic ways of propagating error compared to linear formulas? Organizationally, the propagation subsection can refer forward to bootstrapping/synthetic as alternatives (or superior methods if that's what you think).

ajschult commented 7 years ago

@mangiapasta Yes, bootstrapping can be used for uncertainty propagation. What I called "simple" bootstrapping can be used when the Taylor series approach fails due to nonlinearity (such as for free energy), which I should probably highlight more clearly.

I consider the Taylor series approach to be the primary means to propagate uncertainty (due to its simplicity), with the caveat that you have to be aware of its limitations. If we're doing something simple, then we just use the Taylor series. If we're doing something more complicated and there are signs of trouble (unable to capture scatter in the data, or unable to describe variation when repeating the calculation), then we would (as you say) try other approaches to see which works best.

ajschult commented 7 years ago

@dmzuckerman Simple bootstrapping should be more robust than the Taylor series approach. Parametric bootstrapping has many of the same problems. It avoids the need to express y=f(x), but can still have trouble with nonlinear functions. In the free energy example, if <exp(-beta deltaU)> = 1 +/- 0.5 and you generate new estimates for that quantity from a Gaussian centered at 1 with a width of 0.5, then you will at some point generate a negative number that causes catastrophic failure. You'd need to be aware that the distribution of the average exponentials is not Gaussian and use a different model to generate data (and your results in the might still be sensitive to the choice of model).

dmzuckerman commented 7 years ago

@ajschult Thanks for explaining. Please clarify in the article as needed.

ajschult commented 7 years ago

I've tried to address most of what we've talked about here. I'll need to make a second pass once @mangiapasta folds in the synthetic data section.

ajschult commented 7 years ago

I also noticed that we don't have anywhere in the document that discusses repeating simulations to either check or compute uncertainties (except to mention this idea in the definition of precision). I could add a sentence or two to the block averaging section (it's a lot like using the whole simulation as a block). Or it could go in a new section. I see that we originally planned a "basics" section (how to report, what goal to shoot for, significant figures) and it could also live there if we do that.

agrossfield commented 7 years ago

Agreed. That's actually far and away the best way to estimate uncertainty (you get to use the classic statistical formulas!), but there is a catch. If you're going to claim the individual trajectories are independent measurements, you need to make the starting structures as different as you can. What that means is system-dependent. For a protein, you need to start with the native state, but you could rebuild the water and environment. If it's a liquid or membrane simulation, you need to rebuild the box with a new random seed. For materials, there's probably something else you can do.

I'm not going to have a chance to work on this this weekend, which is why I wrote the long paragraph here. Can somebody put something like that in?

Dr. Alan Grossfield Department of Biochemistry and Biophysics University of Rochester Medical Center

On Sep 30, 2017, at 9:07 AM, Andrew Schultz notifications@github.com<mailto:notifications@github.com> wrote:

I also noticed that we don't have anywhere in the document that discusses repeating simulations to either check or compute uncertainties (except to mention this idea in the definition of precision). I could add a sentence or two to the block averaging section (it's a lot like using the whole simulation as a block). Or it could go in a new section. I see that we originally planned a "basics" section (how to report, what goal to shoot for, significant figures) and it could also live there if we do that.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmzuckerman_Sampling-2DUncertainty_issues_6-23issuecomment-2D333307384&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=q_khVzHqXJcizXND9yt0nnVAjiSEwocNRe8DVmxGYkE&s=ehROiSNPe5sBCryCIlQSyuK0-dbKlPVbJwhczvKkoqo&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AM-5F-2D8gm3A6M9D8K0FMVrwaTnhmd-2Dkighks5snj0egaJpZM4PnxtI&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=q_khVzHqXJcizXND9yt0nnVAjiSEwocNRe8DVmxGYkE&s=UheE6Cuv2aBG3fNODzLaKh5sqrXTg-F-0rEKHpbkHlQ&e=.

ajschult commented 7 years ago

OK, the basics section exists and discusses computing uncertainties from multiple simulations. The original outline suggested that this section would include "what to shoot for", but I'm not sure how to prescribe a target uncertainty.

mangiapasta commented 7 years ago

@ajschult @agrossfield The dark uncertainty method actually does provide a way to estimate uncertainties on the basis of multiple simulations. It might make sense to fold the relevant sections together.

mangiapasta commented 7 years ago

I merged the Bootstrap / synthetic data section.

Below I've written some general comments / thoughts on this section. I'm happy to make changes now, let others do it, or hold off until a later stage, but I wanted to get my thoughts down now.

1) Title: I would change "Computing error in specific observables" to "Computing uncertainty in specific observables."

2) If we use a standard document / set of documents as a source of definitions, we should state that upfront.

3) In that vein, I would changes the first instances of "standard error" to "standard uncertainty"

4) In the bootstrap section, I added Andrew's example about the pitfalls of using a Gaussian to model free energies. It would be good if someone gives that a thorough look-over to make sure I didn't say anything stupid (I don't typically work with free-energies).

dmzuckerman commented 7 years ago

@mangiapasta - thanks for all that. some replies ...

Title: I would change "Computing error in specific observables" to "Computing uncertainty in specific observables." ** seems fine
If we use a standard document / set of documents as a source of definitions, we should state that upfront. ** do you have a suggestion for this?
In that vein, I would changes the first instances of "standard error" to "standard uncertainty" ** i think standard error (of the mean) is well known. i'm not familiar with standard uncertainty ... are you trying to get at the difference between a confidence interval and the std err? certainly that is important!
In the bootstrap section, I added Andrew's example about the pitfalls of using a Gaussian to model free energies. It would be good if someone gives that a thorough look-over to make sure I didn't say anything stupid (I don't typically work with free-energies). ** hopefully @ajschult can have a quick look

ajschult commented 7 years ago

@mangiapasta @dmzuckerman yes, the bit about applying the parametric bootsrapping to free energy looks good to me.

agrossfield commented 7 years ago

“Standard error” is the technical term from statistics

Dr. Alan Grossfield Associate Professor Department of Biochemistry and Biophysics University of Rochester Medical Center 610 Elmwood Ave, Box 712 Rochester, NY 14642 Phone: 585 276 4193<tel:585%20276%204193> http://membrane.urmc.rochester.eduu http://membrane.urmc.rochester.edu

On Oct 3, 2017 at 5:35 PM, dmzuckerman<mailto:notifications@github.com> wrote:

@mangiapastahttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mangiapasta&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=80W-rP8APCmwQIKc2ALT7vO1lxfPp5Y5Hv1OXv7iCZk&s=yK7g0SBJ5mgHblL7TaB4EFpXfMZXHeiwNKi5G55WSsg&e= - thanks for all that. some replies ...

Title: I would change "Computing error in specific observables" to "Computing uncertainty in specific observables." ** seems fine
If we use a standard document / set of documents as a source of definitions, we should state that upfront. ** do you have a suggestion for this?
In that vein, I would changes the first instances of "standard error" to "standard uncertainty" ** i think standard error (of the mean) is well known. i'm not familiar with standard uncertainty ... are you trying to get at the difference between a confidence interval and the std err? certainly that is important!
In the bootstrap section, I added Andrew's example about the pitfalls of using a Gaussian to model free energies. It would be good if someone gives that a thorough look-over to make sure I didn't say anything stupid (I don't typically work with free-energies). ** hopefully @ajschulthttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ajschult&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=80W-rP8APCmwQIKc2ALT7vO1lxfPp5Y5Hv1OXv7iCZk&s=zqivywIWyiFxBQibbZ9_oFbCKBcSwLfvFUVekEu1-yY&e= can have a quick look

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmzuckerman_Sampling-2DUncertainty_issues_6-23issuecomment-2D333985784&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=80W-rP8APCmwQIKc2ALT7vO1lxfPp5Y5Hv1OXv7iCZk&s=lsBTP-WLtINo66fVQVx0u17akhek5zh-GcA3lI0J1J8&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AM-5F-2D8mpGEl6vE7n0TpgKsyGUUU3hHlU6ks5soqiigaJpZM4PnxtI&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=80W-rP8APCmwQIKc2ALT7vO1lxfPp5Y5Hv1OXv7iCZk&s=vGmmwqKzBBu5cl-s2ZTbVrMERVWTD2C_b3NQsG8jdxQ&e=.

mangiapasta commented 7 years ago

Re: documents for source of definitions. Dan Siderius had a pretty comprehensive list on the thread about terminology / NIST internal review. I attached the "Guide to the expression of uncertainty in measurement," (GUM) which is one of our references for issues like this.

So, for example, standard uncertainty is defined on page 3 as "uncertainty of the result of a measurement expressed as a standard deviation." Note 3 on p. 36 states, '“Experimental standard deviation of the mean” is sometimes incorrectly called standard error of the mean.'

So, for example, in the first paragraph of the text, it might make sense to change, "...standard error. The standard error is the standard deviation of the distribution of the results that would be obtained by repeating the simulation."

to

"...standard uncertainty. The standard uncertainty is an estimate expressed as a standard deviation ( e.g. over the result of many simulations) -- of the width of the true distribution in a prediction," or something similar.

Maybe we can even just use the definition straight from GUM.

JCGM_100_2008_E (1).pdf

dmzuckerman commented 7 years ago

@mangiapasta @ajschult @agrossfield I would like to get your input on something sorta important ...

Currently the draft says, "When reporting these estimates, it is important to also provide an estimate of the uncertainty of the result, typically as a standard error." On the one hand, this is perfectly reasonable, and once you know what is being reported, you essentially have all the info you need ... but on the other hand, I feel that folks will confuse this single std err with a healthy confidence interval - which it is not. I'm not optimistic that our readers (and equally importantly, the readers of their papers) will convert from uncertainty scale to confidence interval in their minds. And anything shown visually (like error bars!) leaves a lasting impression regardless of any more careful wording that may be in a paper. So I would say we should encourage folks to show plus/minus TWO standard errors (std uncertainties).

All this was recently brought home to me when we were doing some validation studies comparing two data sets that should have agreed within noise, but the error bars only overlapped when we used two standard errors.

I wanted to get feedback on this before I edit the document. If folks agree (or even if not) I can make a figure from the data I mentioned to illustrate the issue.

mangiapasta commented 7 years ago

I would actually argue a slightly different perspective, namely that the concept of a "healthy" confidence interval can't be divorced from a particular application or conclusion that one is trying to draw from the data.

I think degenerate cases illustrate the point.

If I'm trying to design an airplane wing using simulation, then my uncertainty bounds had better be conservative. I want to be 99.999% sure that the wing isn't going to break in flight. So I might pick a max-min confidence interval, or perhaps 5-sigma. Admittedly this may be an unrealistic application for MD simulation, but I seem to recall that in astronomy, for example, 5-sigma is the minimum uncertainty bound that the community accepts as the basis for concluding that a measurement is statistically significant.

On the conservative side, I find Chebyshev's inequality useful for deciding what is a lower bound on uncertainty associated with a given sigma value. See, e.g. the first few paragraphs of https://en.wikipedia.org/wiki/Chebyshev%27s_inequality . The 68-95-99.7 rule is only valid for Gaussian distributions, which do not apply to all physical phenomena.

On the other hand, if one is only trying to elicit trends in data, one sigma may be perfectly reasonable. I've done in material design problems where we only care about rank-ordering properties of different materials.

If I'm just reporting a generic uncertainty, e.g. for archival data, I usually go with 3-sigma. By Chebyshev's inequality, that corresponds roughly to a 90% confidence interval.

From: dmzuckerman notifications@github.com Sent: Tuesday, October 10, 2017 9:59:11 PM To: dmzuckerman/Sampling-Uncertainty Cc: Patrone, Paul (Fed); Mention Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

@mangiapastahttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmangiapasta&data=02%7C01%7Cpaul.patrone%40nist.gov%7C530ea8e4503b46a91d9a08d5104bab81%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636432839538702073&sdata=LjwfMfNc2umtX9ih4YcGGyimL0%2FwAq23GDSpxh%2BsToU%3D&reserved=0 @ajschulthttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fajschult&data=02%7C01%7Cpaul.patrone%40nist.gov%7C530ea8e4503b46a91d9a08d5104bab81%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636432839538702073&sdata=xVoKz2l%2FX%2BbfwOIeo4VRrVYfD%2BURUThpMSHuc%2B838Fo%3D&reserved=0 @agrossfieldhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fagrossfield&data=02%7C01%7Cpaul.patrone%40nist.gov%7C530ea8e4503b46a91d9a08d5104bab81%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636432839538702073&sdata=4blrvALZ6fGbizKF%2FMoKfOCo13c7COK97CZo%2BxVp4sc%3D&reserved=0 I would like to get your input on something sorta important ...

Currently the draft says, "When reporting these estimates, it is important to also provide an estimate of the uncertainty of the result, typically as a standard error." On the one hand, this is perfectly reasonable, and once you know what is being reported, you essentially have all the info you need ... but on the other hand, I feel that folks will confuse this single std err with a healthy confidence interval - which it is not. I'm not optimistic that our readers (and equally importantly, the readers of their papers) will convert from uncertainty scale to confidence interval in their minds. And anything shown visually (like error bars!) leaves a lasting impression regardless of any more careful wording that may be in a paper. So I would say we should encourage folks to show plus/minus TWO standard errors (std uncertainties).

All this was recently brought home to me when we were doing some validation studies comparing two data sets that should have agreed within noise, but the error bars only overlapped when we used two standard errors.

I wanted to get feedback on this before I edit the document. If folks agree (or even if not) I can make a figure from the data I mentioned to illustrate the issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmzuckerman%2FSampling-Uncertainty%2Fissues%2F6%23issuecomment-335658886&data=02%7C01%7Cpaul.patrone%40nist.gov%7C530ea8e4503b46a91d9a08d5104bab81%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636432839538702073&sdata=yWoUSOSkAho7rMjLASnu9%2Bfsv63ZOW9LKZNDVsFph8M%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAd9eEVh7PVMepuFfVPM-m5MSNHuN-Y_0ks5srCDvgaJpZM4PnxtI&data=02%7C01%7Cpaul.patrone%40nist.gov%7C530ea8e4503b46a91d9a08d5104bab81%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636432839538702073&sdata=viA5L00Dt0BQmGse47gr3mMiGv4d4RrPjeW%2BrMQhKAU%3D&reserved=0.

dmzuckerman commented 7 years ago

@mangiapasta Good points - especially about the purpose driving the approach.

Part of me would love to advocate for 3-sigma, but in the biomolecular world, that might invalidate all conclusions ever reported! (Only half joking.) In any case, we can explain the inequality issue.

FYI in the biomolecular world, a common goal is to compare two 'conditions' (a protein bound to a ligand or not; a wild-type vs a mutant protein; high [salt] vs low; ...) so conservative confidence intervals are indeed needed to enable confident distinctions.

@agrossfield - I'd love to get your view when you have the time. This is our chance to set a meaningful standard for our field. At a minimum we need to explain the issue carefully, I think.

agrossfield commented 7 years ago

I’ll write more when I have time, but part of me would be happy if people actually reported anything meaningful in terms of standard error. The most common thing still is just the standard dev from the sim itself, which is worthless as a convergence measure. If someone reports +/- 1 se (e.g. computed from multiple trajectories), I’m happy, because I can mentally double it and see the difference.

I think it’s much more important to focus on getting people to actually compute a meaningful standard error, rather than a wish-it-was standard error. :)

Alan

On Oct 11, 2017, at 1:53 PM, dmzuckerman notifications@github.com<mailto:notifications@github.com> wrote:

@mangiapastahttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mangiapasta&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=VpvMFjvzqB2RlxxaUG0bKNS47SJZzvLPU4ej8Gtavro&s=sng5dX7JA_wIu5w3wxwy_bSxd9S5bp7A18p48wQjYCA&e= Good points - especially about the purpose driving the approach.

Part of me would love to advocate for 3-sigma, but in the biomolecular world, that might invalidate all conclusions ever reported! (Only half joking.) In any case, we can explain the inequality issue.

FYI in the biomolecular world, a common goal is to compare two 'conditions' (a protein bound to a ligand or not; a wild-type vs a mutant protein; high [salt] vs low; ...) so conservative confidence intervals are indeed needed to enable confident distinctions.

@agrossfieldhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_agrossfield&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=VpvMFjvzqB2RlxxaUG0bKNS47SJZzvLPU4ej8Gtavro&s=2iaSUwbedU7fHFoVvvUEpAc6-0sbhyqn1hELEcwgB68&e= - I'd love to get your view when you have the time. This is our chance to set a meaningful standard for our field. At a minimum we need to explain the issue carefully, I think.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmzuckerman_Sampling-2DUncertainty_issues_6-23issuecomment-2D335894340&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=VpvMFjvzqB2RlxxaUG0bKNS47SJZzvLPU4ej8Gtavro&s=g8D7FHk74z0rRALc8ylTvCrlTSaTGn8EDi3BNmXM5r0&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AM-5F-2D8o972U8W4-2DJfYNtbgoam94r-2DkRfXks5srQCYgaJpZM4PnxtI&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=VpvMFjvzqB2RlxxaUG0bKNS47SJZzvLPU4ej8Gtavro&s=cDNkBqOtfQBA0RNmwL17T_uEEAbbndHAtyJi2Acnd9o&e=.

Dr. Alan Grossfield Associate Professor Department of Biochemistry and Biophysics University of Rochester Medical Center 610 Elmwood Ave, Box 712 Rochester, NY 14642 Phone: 585 276 4193 http://membrane.urmc.rochester.edu

ajschult commented 7 years ago

I prefer reporting 1 sigma as the uncertainty. I think of the error bars as a indication of the variability of the data rather than a bound on the true value (because there is no bound). 1 sigma is a far better description of the variability than 3 or 5 sigma. Beyond that, conservative error bars just create a different problem. Someone looks at two values (say 6 +/- 4 and 9 +/- 4) and says they agree because they're within each other's error bars. But they actually disagree because sigma=1 and "4" is actually 4 sigma.

I do think it would be helpful to provide more explanation about confidence limits and how they relate to sigma. We currently only mention it in passing as part of the "Best Practices" checklist. And we can say that reporting something other than 1 sigma is an option, but that you should always explain what the reported uncertainties are (1 sigma, 3 sigma, etc).

agrossfield commented 7 years ago

Beautifully said, Andrew.

Dr. Alan Grossfield Associate Professor Department of Biochemistry and Biophysics University of Rochester Medical Center 610 Elmwood Ave, Box 712 Rochester, NY 14642 Phone: 585 276 4193<tel:585%20276%204193> http://membrane.urmc.rochester.eduu http://membrane.urmc.rochester.edu

On Oct 11, 2017 at 7:21 PM, <Andrew Schultzmailto:notifications@github.com> wrote:

I prefer reporting 1 sigma as the uncertainty. I think of the error bars as a indication of the variability of the data rather than a bound on the true value (because there is no bound). 1 sigma is a far better description of the variability than 3 or 5 sigma. Beyond that, conservative error bars just create a different problem. Someone looks at two values (say 6 +/- 4 and 9 +/- 4) and says they agree because they're within each other's error bars. But they actually disagree because sigma=1 and "4" is actually 4 sigma.

I do think it would be helpful to provide more explanation about confidence limits and how they relate to sigma. We currently only mention it in passing as part of the "Best Practices" checklist. And we can say that reporting something other than 1 sigma is an option, but that you should always explain what the reported uncertainties are (1 sigma, 3 sigma, etc).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmzuckerman_Sampling-2DUncertainty_issues_6-23issuecomment-2D335976079&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=74a6V60_2wYhbckrrSI2mhK534OalZba9zCd0D4y9yI&s=-UpPZYIOyX4nGrC_50GPNOc1YxaZ246s3T0KWjVBH3g&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AM-5F-2D8u1mDOy87uM3yjoQYcP-2DGwxHyF1Vks5srU2WgaJpZM4PnxtI&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=74a6V60_2wYhbckrrSI2mhK534OalZba9zCd0D4y9yI&s=UKMwOjyAoaIyFyrmF1DcN-HqlH6UfHxbk_T3wUYHgUQ&e=.

dwsideriusNIST commented 7 years ago

All, I've stayed out of this thread for various reasons. To get a digest version, can you clarify if the group is leaning toward declaration of a "error bars mean X (and only X)" statement in the paper?

My issue is that error bars are meaningless in the absence of stated terminology (see https://github.com/dmzuckerman/Sampling-Uncertainty/issues/9)

mangiapasta commented 7 years ago

Fair point. It seems that we're conflating several issues here (or perhaps anticipating that junior scientists will...?).

Based on the previous comments, it seems to me that we are conflating the choice of error bar (e.g. 1-sigma, 2-sigma, etc.) with an assessment of statistical significance. That is, if two data-points have overlapping error bars, are they statistically significant, or is there a difference? One can always change the error bars so that there is much overlap or none at all. So, in that sense, error bars are not statements about statistical comparison per se. Maybe this is what Andrew was saying earlier.

On the other hand, Dan's comment is consistent with Alan's perspective (I think -- @agrossfield , chime in). A lot of time when we read papers, we take error bars as a proxy for statistical comparison. The latter can in fact be a complicated task because one has to take into account what type of uncertainty, should we interpret standard deviations using the 68-95-99.7 rule or Chebyshev's inequality, etc. If we don't take the time to consider these issues, we may be misled by the error bars. This is probably more of a problem for junior readers.

So, @dwsideriusNIST I think we're trying to answer exactly your question. Do we take error bars to always be 1-sigma because error bars are just a convention to represent uncertainty? Or, do we suggest that error bars should reflect the underlying statistical comparison that we are trying to make with the uncertainty quantification?

@dmzuckerman Is this the question you were asking?

agrossfield commented 7 years ago

I agree with what you wrote, but my point was even simpler. If we’re going to introduce a new term (standard uncertainty) to replace an earlier one (standard error), which is itself often used wrong, we have to not only tell people what the correct way to say it is, but also what is often meant when using the term incorrectly, especially because a lot of papers are not clear about what they did.

Said another way: we need to equip people to contribute to literature in the best way, while also helping them digest the existing literature, warts and all.

Dr. Alan Grossfield Associate Professor Department of Biochemistry and Biophysics University of Rochester Medical Center 610 Elmwood Ave, Box 712 Rochester, NY 14642 Phone: 585 276 4193<tel:585%20276%204193> http://membrane.urmc.rochester.eduu http://membrane.urmc.rochester.edu

On Oct 12, 2017 at 11:40 AM, mangiapasta<mailto:notifications@github.com> wrote:

Fair point. It seems that we're conflating several issues here (or perhaps anticipating that junior scientists will...?).

Based on the previous comments, it seems to me that we are conflating the choice of error bar (e.g. 1-sigma, 2-sigma, etc.) with an assessment of statistical significance. That is, if two data-points have overlapping error bars, are they statistically significant, or is there a difference? One can always change the error bars so that there is much overlap or none at all. So, in that sense, error bars are not statements about statistical comparison per se. Maybe this is what Andrew was saying earlier.

On the other hand, Dan's comment is consistent with Alan's perspective (I think -- @agrossfieldhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_agrossfield&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=RZJye_dKn5al-zY5RyWaHMhG4vRuM9Ng3nfoEGPQAHE&s=Oe0urxFSIiOF_nMNMVGny3eXE5rCvT_B7RHyC_pYlxs&e= , chime in). A lot of time when we read papers, we take error bars as a proxy for statistical comparison. The latter can in fact be a complicated task because one has to take into account what type of uncertainty, should we interpret standard deviations using the 68-95-99.7 rule or Chebyshev's inequality, etc. If we don't take the time to consider these issues, we may be misled by the error bars. This is probably more of a problem for junior readers.

So, @dwsideriusNISThttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dwsideriusnist&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=RZJye_dKn5al-zY5RyWaHMhG4vRuM9Ng3nfoEGPQAHE&s=zcHn6NA_5qXXRjoZj2ih4FKd8Gi8kEHnUghiYadATHU&e= I think we're trying to answer exactly your question. Do we take error bars to always be 1-sigma because error bars are just a convention to represent uncertainty? Or, do we suggest that error bars should reflect the underlying statistical comparison that we are trying to make with the uncertainty quantification?

@dmzuckermanhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmzuckerman&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=RZJye_dKn5al-zY5RyWaHMhG4vRuM9Ng3nfoEGPQAHE&s=Bh43BLO_a3UgZtg0J6YQgCJ5OyX2mqP73ixaKiTyg-g&e= Is this the question you were asking?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmzuckerman_Sampling-2DUncertainty_issues_6-23issuecomment-2D336177102&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=RZJye_dKn5al-zY5RyWaHMhG4vRuM9Ng3nfoEGPQAHE&s=3AcIVYQnGZIdk3mHSuKKLizh9-JpVKetVqdaZvmOcDg&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AM-5F-2D8gg1ZAL0BPJrX5ZLNXB-5FgamTI2EXks5srjLLgaJpZM4PnxtI&d=DwMFaQ&c=4sF48jRmVAe_CH-k9mXYXEGfSnM3bY53YSKuLUQRxhA&r=49qnaP-kgQR_zujl5kbj_PmvQeXyz1NAoiLoIzsc27zuRX32UDM2oX8NQCaAsZzH&m=RZJye_dKn5al-zY5RyWaHMhG4vRuM9Ng3nfoEGPQAHE&s=zMaNXvt75D1v6jPcNG4ot4h8WhDd-FyWTGWcbZZQlQE&e=.

dmzuckerman commented 7 years ago

I agree with @agrossfield that we need to be careful with terms and err on the side of being overly explanatory and repetitive. I don't think it's sufficient to state the definition once and assume readers will get it.

@mangiapasta wrote "Do we take error bars to always be 1-sigma because error bars are just a convention to represent uncertainty? Or, do we suggest that error bars should reflect the underlying statistical comparison that we are trying to make with the uncertainty quantification? @dmzuckerman Is this the question you were asking?"

Yes, more or less. I'm thinking about statistical comparisons (both in a given paper and perhaps performed post-facto by another simulator or experimentalist) and people's understanding. We have to assume that a good chunk of the ultimate consumers of reported error bars (i.e., our readers AND THEIR READERS) will not be energetic or sophisticated enough to extrapolate from one sigma as trivial as that may seem to us.

So my own initial thought would be to recommend reporting 2*sigma (and also to say so!) ... and let the experts extrapolate back to one sigma if they like. The experts will always be ok. I worry about subtly promoting the continued mis-use of simulation data by giving our official blessing to misleadingly small error bars.

mangiapasta commented 7 years ago

@dmzuckerman

"So my own initial thought would be to recommend reporting 2*sigma (and also to say so!) ... and let the experts extrapolate back to one sigma if they like. The experts will always be ok. I worry about subtly promoting the continued mis-use of simulation data by giving our official blessing to misleadingly small error bars."

I think you and Alan are correct to point out the social engineering aspects of this document. So I'd agree with your recommendation, provided we emphasize the distinction between error bars and statistical comparison. To the extent that we can point readers to the fact that there is a "higher-level" of thinking (that sounds really hoity-toity ... sorry) with regards to statistical comparison, I think we only help the community. This emphasis need not be overly drawn out, but I do think it would be good to mention some of the issues we've written about in this thread.

dmzuckerman commented 7 years ago

From today's discussion

Section to be renamed 'Error bars and uncertainties in specific observables'
Must introduce two ideas: scale of uncertainty (std uncert) and confidence interval
Recommend reporting conf intervals based on t-test
Give caveats: different confidence intervals for different purposes; also not specific tests for assessing whether two estimates agree or disagree
INCLUDE COMMON ERRORS!! (esp std dev vs std uncert/error)
Use standard terminology and hyperlink
Jack-knife should be mentioned for small data sets
Small data sets (N<5): just show all values.

dmzuckerman commented 7 years ago

@dwsideriusNIST @mangiapasta @ajschult @agrossfield Today I really sweated and updated the specific observables section following our discussion ... to recommend using 90% confidence intervals. I think this is the most important section of the paper (after the definitions, of course, and the quick-and-dirty checks which will eliminate 95% of studies) so please read it when you have time and critique/edit. Thank you! --Dan

dmzuckerman commented 7 years ago

Paul @mangiapasta, would you please see whether you agree that the section you drafted (7.3, now called 'Assessing qualitative behavior of data') should be moved to 'quick and dirty' section?

Also, I would appreciate your review of the sections I revised/added

7.1 'Basics'
7.6 'From standard uncertainty to confidence interal for Gaussian variables'

Thank you!

mangiapasta commented 7 years ago

I'll take a look today and get back to you soon.

From: dmzuckerman notifications@github.com Sent: Wednesday, November 15, 2017 11:06:52 AM To: dmzuckerman/Sampling-Uncertainty Cc: Patrone, Paul (Fed); Mention Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

Paul @mangiapastahttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmangiapasta&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cb2dc71fec37f4a39549508d52c42e4fe%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636463588184453263&sdata=KGl0fmZ8yyaNS2ePZoYtWHoECYDZQvdjxDtnWFc7idY%3D&reserved=0, would you please see whether you agree that the section you drafted (7.3, now called 'Assessing qualitative behavior of data') should be moved to 'quick and dirty' section?

Also, I would appreciate your review of the sections I revised/added

7.1 'Basics'
7.6 'From standard uncertainty to confidence interal for Gaussian variables'

Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmzuckerman%2FSampling-Uncertainty%2Fissues%2F6%23issuecomment-344640654&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cb2dc71fec37f4a39549508d52c42e4fe%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636463588184453263&sdata=jpPyv7pjllcXIx6prnBKIkkfwCiQgNc03XFLfk%2FevrQ%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAd9eEcTi2pYFwKZbYl9MpiKHWSgYMQt_ks5s2wwZgaJpZM4PnxtI&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cb2dc71fec37f4a39549508d52c42e4fe%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636463588184453263&sdata=07aQ2FGeJ4vlgKZ0GTv7UXmfQSy1TRR2eJhQbRN%2BKGQ%3D&reserved=0.

mangiapasta commented 7 years ago

I'm fine with merging the "qualitative behavior of data" section with the quick and dirty section. But can we perhaps rename the latter? "Quick and dirty" sounds a little pejorative, like it shouldn't be taken too seriously. Maybe something like "Preliminary checks that can rule out..." or similar?

From: dmzuckerman notifications@github.com Sent: Wednesday, November 15, 2017 11:06:52 AM To: dmzuckerman/Sampling-Uncertainty Cc: Patrone, Paul (Fed); Mention Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

Paul @mangiapastahttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmangiapasta&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cb2dc71fec37f4a39549508d52c42e4fe%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636463588184453263&sdata=KGl0fmZ8yyaNS2ePZoYtWHoECYDZQvdjxDtnWFc7idY%3D&reserved=0, would you please see whether you agree that the section you drafted (7.3, now called 'Assessing qualitative behavior of data') should be moved to 'quick and dirty' section?

Also, I would appreciate your review of the sections I revised/added

7.1 'Basics'
7.6 'From standard uncertainty to confidence interal for Gaussian variables'

Thank you!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmzuckerman%2FSampling-Uncertainty%2Fissues%2F6%23issuecomment-344640654&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cb2dc71fec37f4a39549508d52c42e4fe%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636463588184453263&sdata=jpPyv7pjllcXIx6prnBKIkkfwCiQgNc03XFLfk%2FevrQ%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAd9eEcTi2pYFwKZbYl9MpiKHWSgYMQt_ks5s2wwZgaJpZM4PnxtI&data=02%7C01%7Cpaul.patrone%40nist.gov%7Cb2dc71fec37f4a39549508d52c42e4fe%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636463588184453263&sdata=07aQ2FGeJ4vlgKZ0GTv7UXmfQSy1TRR2eJhQbRN%2BKGQ%3D&reserved=0.

dmzuckerman commented 7 years ago

@mangiapasta you make a very good point - I agree folks might not want to invest time into something quick and dirty!

What if we call it "Essential Qualitative Analysis" or "Essential Preliminary Analysis"? I like having the word 'essential' (or similar) in there to indicate the importance.

mangiapasta commented 7 years ago

I'm fine with some variation on that. As an alternative, you might also consider something like "Initial data screening" or something similar. I like the word screening if only because it suggests filtering out bad data, but (hopefully) in a systematic way

From: dmzuckerman notifications@github.com Sent: Thursday, November 16, 2017 1:20:46 PM To: dmzuckerman/Sampling-Uncertainty Cc: Patrone, Paul (Fed); Mention Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

@mangiapastahttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmangiapasta&data=02%7C01%7Cpaul.patrone%40nist.gov%7C44eb5d68da674f09b42208d52d1ec258%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636464532502100154&sdata=caNbe6%2BWul6XrhOusBZhrbagF%2FFw6kzmvqX29sELTbI%3D&reserved=0 you make a very good point - I agree folks might not want to invest time into something quick and dirty!

What if we call it "Essential Qualitative Analysis" or "Essential Preliminary Analysis"? I like having the word 'essential' (or similar) in there to indicate the importance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmzuckerman%2FSampling-Uncertainty%2Fissues%2F6%23issuecomment-345012221&data=02%7C01%7Cpaul.patrone%40nist.gov%7C44eb5d68da674f09b42208d52d1ec258%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636464532502100154&sdata=S7J4FMD3B7R8yD%2FRHcPwkqZPHh%2BjWQblGEUs5nh1NXY%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAd9eEemoLrLzcdWi9Rqv2GC2VxhQPqAzks5s3Hz9gaJpZM4PnxtI&data=02%7C01%7Cpaul.patrone%40nist.gov%7C44eb5d68da674f09b42208d52d1ec258%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636464532502100154&sdata=sBcSm961ity2AqSdSlmI6r4txSc0%2FfuBNBDTy1r0%2BJw%3D&reserved=0.

mangiapasta commented 7 years ago

Overall I like section 7.1 but thought it needed a bit of reorganization.

Basically, it seems to me that there are two main issues you are trying to address.

First, why should we use confidence intervals instead of standard uncertainties? It sounds like you there are two related answers to this

A) Confidence intervals are more relatable to everyday experience because it is a frequentist approach to uncertainty (i.e. we expect an outcome 90% of the time)

B) Confidence intervals don't suffer from under (or over) estimating uncertainty. Admittedly, I added the part about over estimating, but I think that can be as much of a sin as underestimating.

Second, why are we emphasizing the issue of how to report error bars so much up front.

Here it seems that your argument is that we are dealing with the intersection of a societal and scientific issue. The main points seems to me to be that (i) error bars are necessary to accurately assess work, and (ii) even accurate but (visually) misleading error bars can be detrimental if readers aren't careful.

Is this an accurate summary of what you had in mind? I reorganized the text to (hopefully) conform to this structure. I pasted it below instead of making changes in the main file. Let me know your thoughts and/or edit as appropriate. I'll look at the other section tonight/tomorrow.

Also, I'd like to drop the word "reproducibility" in here because I think that is one of the goals of UQ, namely to indicate the level of reproducibility of a result.

Paul

``What error bar should I report?'' Here we address this simple but critical question.

In general, there is no one-best practice for choosing error bars. However, in the context of simulations, we can nonetheless identify common goals when reporting such estimates: (i) to help authors and readers better understand uncertainty in data; and (ii) to provide readers with realistic information about the reproducibility of a given result.

With this in mind, we recommend the following: (a) in fields where there is a definitive standard for reporting uncertainty, the authors should follow existing conventions; (b) otherwise, such as for biomolecular simulations, \emph{authors should report (and graph) their best estimates of 90\% confidence intervals.} As explained in the glossary above, a 90\% confidence interval is a range of values that is expected to bracket 90\% of the computed predictions \emph{if statistically equivalent simulations are repeated a large number of times.}

We emphasize that as opposed to standard uncertainties (reported as a standard deviation $\sigma$), confidence intervals have several practical benefits that justify their usage. In particular, they directly quantify the statistical frequency with which we expect a given outcome, which is more relatable to everyday experience than moments of a probability distribution. As such, confidence intervals can help authors and readers better understand the implications of an uncertainty analysis. Moreover, downstream users/readers of a given paper may include less statistically-oriented readers for whom confidence intervals are a more meaningful measure of variation.

In a related vein, error bars expressed in terms of $n$ $\sigma$ can be misinterpreted as unrealistically under or overestimating uncertainty if taken at face value. For example, reporting $3$ $\sigma$ uncertainties for a normal random variable amounts to a $99.7$ \% confidence interval, which is likely to be a significant overestimate for many applications. On the other hand, $1$ $\sigma$ uncertainties only correspond to a $68$ \% confidence interval, which may be too low. Given that many readers may not take the time to make such conversions in their heads, we feel that it is safest for modelers to do this work up front.

In recommending 90 \% confidence intervals, we are admittedly attempting to address a social issue that nevertheless has important implications for science as a whole. In particular, the authors of a study and the reputation of their field do not benefit in the long run by under-representing uncertainty, since this may lead to incorrect conclusions. But perhaps just as importantly, many of the same problems can arise if uncertainties are reported in a technically correct but obscure and difficult-to-interpret manner. For example, 1 $\sigma$ error bars may not overlap and thereby mask the inability to statistically distinguish two quantities, since the corresponding confidence intervals are only 68 \%. With this in mind, we therefore wish to emphasize that visual impressions conveyed by figures in a paper are of primary importance. Regardless of what a research paper may explain carefully in text, error bars on graphs create a lasting impression and must be as informative and accurate as possible. If 90\% confidence intervals are reported, the expert reader can easily estimate the smaller standard uncertainty (especially if it is noted in the text), but showing a graph with overly small error bars is bound to mislead most readers -- even experts who do not search out the fine print.

From: Patrone, Paul (Fed) Sent: Thursday, November 16, 2017 1:34:35 PM To: dmzuckerman/Sampling-Uncertainty Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

I'm fine with some variation on that. As an alternative, you might also consider something like "Initial data screening" or something similar. I like the word screening if only because it suggests filtering out bad data, but (hopefully) in a systematic way

From: dmzuckerman notifications@github.com Sent: Thursday, November 16, 2017 1:20:46 PM To: dmzuckerman/Sampling-Uncertainty Cc: Patrone, Paul (Fed); Mention Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

@mangiapastahttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmangiapasta&data=02%7C01%7Cpaul.patrone%40nist.gov%7C44eb5d68da674f09b42208d52d1ec258%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636464532502100154&sdata=caNbe6%2BWul6XrhOusBZhrbagF%2FFw6kzmvqX29sELTbI%3D&reserved=0 you make a very good point - I agree folks might not want to invest time into something quick and dirty!

What if we call it "Essential Qualitative Analysis" or "Essential Preliminary Analysis"? I like having the word 'essential' (or similar) in there to indicate the importance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmzuckerman%2FSampling-Uncertainty%2Fissues%2F6%23issuecomment-345012221&data=02%7C01%7Cpaul.patrone%40nist.gov%7C44eb5d68da674f09b42208d52d1ec258%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636464532502100154&sdata=S7J4FMD3B7R8yD%2FRHcPwkqZPHh%2BjWQblGEUs5nh1NXY%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAd9eEemoLrLzcdWi9Rqv2GC2VxhQPqAzks5s3Hz9gaJpZM4PnxtI&data=02%7C01%7Cpaul.patrone%40nist.gov%7C44eb5d68da674f09b42208d52d1ec258%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636464532502100154&sdata=sBcSm961ity2AqSdSlmI6r4txSc0%2FfuBNBDTy1r0%2BJw%3D&reserved=0.

mangiapasta commented 7 years ago

Overall I think section 7.6 is fine. The only change I made was to edit one of the table entries, which was originally k=1.03. I think it should have been 1.73 interpolating between the two values surrounding it. Would be good to check that I put the correct value. Otherwise the section reads well to me. I think it's fine to suggest that folks show data for 5 or less values and start using the t-distribution for 6 or more data points.

dmzuckerman commented 7 years ago

@mangiapasta thanks a lot. I looked up that number - happens to be 1.72 - and fixed it. Good catch.

dmzuckerman commented 7 years ago

@mangiapasta somehow I missed that whole revision of Sec. 7.1 in this thread! I think your clarifications/edits are excellent. The one quibble I have is with this wording: "standard uncertainties (reported as a standard deviation $\sigma$)." Since we are (I think) referring to what our readers will know as the std err of mean, we should say so. A lot of readers won't have the statistical sophistication to appreciate that the std err is just a std dev (for the mean) and so saying 'reported as a std dev' will be very confusing. If we don't say it's the std err, perhaps just omit further explanation. I had said something about the scale of uncertainty, but that may also not make sense to some ... it's physics-y jargon.

Anyway please go ahead and push your edits to the master. Thanks a lot for doing that so thoughtfully.

mangiapasta commented 7 years ago

Sounds good. I'm fine with the changes you suggest. I'll push edits to the master today

From: dmzuckerman notifications@github.com Sent: Friday, November 17, 2017 6:58:51 PM To: dmzuckerman/Sampling-Uncertainty Cc: Patrone, Paul (Fed); Mention Subject: Re: [dmzuckerman/Sampling-Uncertainty] Specific observables section (#6)

@mangiapastahttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmangiapasta&data=02%7C01%7Cpaul.patrone%40nist.gov%7C6fc798eb02e64c65ae8a08d52e172701%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636465599337970496&sdata=SWduyzc%2F2CdYG2j18GEtMiyfCuHMyUQcxw3tqtZvgEQ%3D&reserved=0 somehow I missed that whole revision of Sec. 7.1 in this thread! I think your clarifications/edits are excellent. The one quibble I have is with this wording: "standard uncertainties (reported as a standard deviation $\sigma$)." Since we are (I think) referring to what our readers will know as the std err of mean, we should say so. A lot of readers won't have the statistical sophistication to appreciate that the std err is just a std dev (for the mean) and so saying 'reported as a std dev' will be very confusing. If we don't say it's the std err, perhaps just omit further explanation. I had said something about the scale of uncertainty, but that may also not make sense to some ... it's physics-y jargon.

Anyway please go ahead and push your edits to the master. Thanks a lot for doing that so thoughtfully.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdmzuckerman%2FSampling-Uncertainty%2Fissues%2F6%23issuecomment-345397653&data=02%7C01%7Cpaul.patrone%40nist.gov%7C6fc798eb02e64c65ae8a08d52e172701%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636465599337970496&sdata=yPk9N%2FrDHKu3pFcuHgCn6mgis5zqW9owKtoFpuE4aSc%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAd9eERVnkBht8VegdtTMKJSL0nZRhWCrks5s3h26gaJpZM4PnxtI&data=02%7C01%7Cpaul.patrone%40nist.gov%7C6fc798eb02e64c65ae8a08d52e172701%7C2ab5d82fd8fa4797a93e054655c61dec%7C1%7C0%7C636465599337970496&sdata=T1PdUHd7LaELkpkKpXEX5Fe5Jtw5IreLAaTGws4el8o%3D&reserved=0.

mangiapasta commented 7 years ago

@dmzuckerman

I read your comment more carefully and wanted to follow up on the main point. It's not clear to me that we're interested in standard error of the mean every time we report an error bar or confidence interval. That's why I originally "standard uncertainties (reported as a standard deviation $\sigma$)." I think Andrew Schultz at one point mentioned the perspective that error bars can be useful for showing spread in data, which would just be the standard deviation of the data itself. What are your thoughts?

For the time being, I've left the revision as is but will go in and change the language if appropriate.

Thanks

dmzuckerman / Sampling-Uncertainty

Specific observables section #6