Open lwjohnst86 opened 4 years ago
Hey @lwjohnst86, sorry for the delayed response, but that's a major discussion and chances are this issue will remain open for a looong time π
My current personal opinion is the following. While I definitely agree on the problems of "significant" and the general misuse of the p-value and the significance concept, I am not sure that simply removing interpret_p
from the output is a good option.
One practical reason is that the goal of the package is to provide textual interpretation to all indices. Of course, this is an inherently problematic step, and people can't disagree on interpretations (cf. the abundance of guidelines for indices of effect size or indices of fit). So until the p-value completely ceases to be reported (in the tables etc.), report
should still reformat it as for the other indices.
The question then becomes, what is the best way of reformulating it. We agree that "significant" is not the right term for the p-value, it's very misleading and also not relevant in regards to the information it provides (cf. Makowski et al., 2019 π¬) . That being said, what are the established alternatives? Although most of researchers know it's not perfect, it's still the accepted and common way of reporting it. Again here, I'd wait for the establishment of new guidelines (and we are actively working for that to happen).
One might then say, "but this package should be cutting edge and set the right way instead of continuing old-fashioned ways". While we try to push the boundaries of good-practices in general in easystats, and also in report by facilitating complete description of things that are usually left-out, I feel like this isn't the right time for the p-value interpretation to be removed. But this is something we are monitoring and pushing for, so as soon as a good alternative is accepted, we'll be the first to adopt it.
Part of my reasoning comes from the lme4 package example. His creator, by far one of the best statisticians out there, decided to speed up the change for dropping the p-value (as you pointed out, this has been recommended for decades) and simply removed the p-value from the output of this package (note that he had additional arguments because of other issues related to the estimation of p-values in mixed models, but still).
Critically, he explained in many posts why he did it, and how people should adjust to a post-p world. He was right, and everybody knew it. And yet, the first thing that happened is that a package came out providing the missing p-values and everybody started reporting p-values again. Interestingly, in his new implementation of lme4 in Julia (MixedModels.jl), Bates now displays p-values by default π
Long story short, there is a time for everything. I endorse for that matter a Kuhnian perspective: people will drop the p-value once a satisfactory alternative is available (as I do believe that some decision index and threshold is still needed in [some fields of] science).
What do you think?
@DominiqueMakowski thanks for the detailed response.
I agree it is a major discussion. But I think you may have misinterpreted what I was requesting. My opinion is go with the guidelines laid out in by the American Statistical Association (arguably the leading expertise on this topic, one that all other guidelines should follow given, well, they are the experts). I'm not requesting the p-value be removed. The change is fairly simple and is what the ASA recommends as the "alternative": Simply remove any "implied or assigned judgement" to the number and simply report the p-value as is.
So, if the p-value was 0.0012, the textual report would be:
"The association had an estimate of ## (p = 0.0012)."
That way, using your words, you "facilitate complete description of things that are usually left out", aka, by not "leaving out" the current interpretation of p>0.05 as "non-significant" and assuming "no effect", which in reality is not a true interpretation. Not reporting these things because of some arbitrary threshold is, in my view, exactly something that is "left out".
And while I agree with lme4's creator on the inherent (lack of) value of the p-value, I want to reinforce, my recommendation is not at all to remove the p-value. My recommendation is to remove our assigned judgement of what the value means and let the reader decide for themselves. So, again an example, if the p-value is 0.09, the textual report would be something like:
"The association had an estimate of ## (p = 0.09)."
You remove judgement of "causal association" or "true effect association" and instead focus on interpreting all the numbers together in the context of the research question and study design. You as the researcher do the judgement and "detective work" in the discussion. Rather than let some arbitrary threshold that is inappropriately used and completely misunderstood get in the way of good scientific thinking and critique. Leave the "decision index" for the specific discipline, rather than within a package that I assume you want to be more general and usable for a broader audience.
Just to chime in here, I believe there are two independent but related discussions occurring here. I'm familiar with both papers you suggested. Though there's much more nuance to the perspectives, for brevity I believe there are two philosophies being discussed here:
Personally, I'm in favour of the former philosophy and would suggest reading McShane et al. (2019) - an excellent piece which argues that we should be abandoning reliance on any "decision index". However, I don't think that is entirely practical for the scope of report, much to my dismay.
Therefore on a philosophical basis I don't agree with you that:
Leave the "decision index" for the specific discipline, rather than within a package that I assume you want to be more general and usable for a broader audience.
...Because I think thresholds should be removed entirely. Though I see that you are also arguing the reader to interpret themselves which overall fits better with point 1 β there are inherent challenges to this approach that are beyond the scope for debate on this particular issue.
However, what I think you are actually arguing (in part) for is not automatically providing a significance interpretation to the 0.05
threshold. In some ways, it doesn't really fit either of the perspectives perfectly. Pragmatically speaking, I do think it makes a lot of sense to have a customizable significance threshold within report for reporting β forcing the user to make a decision. However, with regards to completely changing report to remove threshold based NHST β I would agree with @DominiqueMakowski and say it's probably not timely (although I would love to see it).
This was an extremely rushed post, apologies for any typos/misunderstanding
McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2019). Abandon statistical significance. The American Statistician, 73(sup1), 235-245.
Interestingly, in his new implementation of lme4 in Julia (MixedModels.jl), Bates now displays p-values by default
Maybe no one is using julia and hence there's no package that calculates the p-values when requested - so he had to bite the bullet and implement p-values? ;-)
One of the interesting parts of this general discussion is, imho, that you can recommend Bayesian methods. Because then you have a posterior distribution, which allows you to derive probability statements of an effects being positive, negative, or larger than X. Now, when the sample size is large enough, you get (almost) the same point estimates and uncertainty (CI) with ML (frequentist) estimation, so essentially you can derive the same inferences. Thus, no matter whether Bayesian or frequentist, the most important thing is to describe the size of an effect (is it (clinical) "significant"?) and its uncertainty. I think this is also one of the conclusion from our paper, and especially authors like Sander Greenland, Valentin Amrhein, or also Andrew Gelman (among others) have contributed a lot good things to this discussion...
Still difficult now to decide what this will mean for report... We may change the wording, or we may add a "Greenland" and default option, the first removing that wording, the latter including "significant"...? In either case, I think we should include these discussions in the documentation of the functions, to raise awareness. Any step that raises awareness towards the misuse of NHST is useful, I'd say. How radical our implementation in report will be, we will see - and we are always open to be more "radical" and remove p-values entirely later or so. ;-)
Suggest replacing "significant" with "reliable."
Or "statistically significant" or "statistically reliable." Also, I'm not sure the passive formulation has been fixed by "can be considered significant" is vague (who considers? significant in what way?). I'd go with "is reliable" or "is not reliable," though the choice of wording is easy enough for the user to amend by hand. Would be nice to implement for BRMS one day, no doubt in the works...
In theory, I agree about replacing "significant" which is a poor choice of words. Now about changing that in report now, here's my perspective:
Not to sound too Marxist or such, but I think the evolution of scientific guidelines and methodological best practices is a struggle between different powers. Some of them (e.g., established old-school figures) embody a conservative force (slowing and constraining the "progress"), some (e.g., young and impetuous minds) are ready to take and impose very radical changes "for the greater good", and some others represent in-between authority instances trying to reconcile these different movements in solid and coherent standards (e.g., societies like APA and their rules). Importantly, I think that these strong dynamics at play are healthy; and all of sides of this battle are useful in some way of another, as powers and counter-powers.
In this context, my collegeagues and myself have sometimes been active actors of change (i.e., on the progressive side of things), for instance by publishing papers pushing for new guidelines in Bayesian stats. We are also very involved in documenting (creating or sharing tutorials, posts, tweets) on why the "old way" of doing stats is bad and how to transition to the new way. Many of our packages allow (if not are designed) to do that. Which brings me to the case of report.
Contrary to some others productions of ours, I think that its place is to be more on the "accepted tradition" ground. Less subject to our own ideologies and desires, but a a reference unto which users can safely rely. That's why the goal is, for instance, to first to implement APA's standards by default. There could be some exceptions, but even though sometimes they might be imperfect, or evolving too slow, I think that report will gain from being an organ of such independent reference, rather than a tool for us to spread what we believe is the best way of doing things (because sometimes we might be wrong, p > .1 π). In other words, we will continue to push for the progress and improvement in statistical and methodological practices by creating, providing and sharing information, with the hope that our ideas permeate into "accepted" guidelines (and end-up in report).
Though again, flexibility is key, and I'm not advocating to blindly and rigidly follow this or that even if obviously doesn't make sense. Also, we could totally envision some argument controlling which "reference" to use, with the inclusion or some more edgy and radical guidelines. To circle back to the "significant" term issue, I think for now it would be premature to change it, as the vast majority of users still expects it. I think the first step is to help people understand how this term is problematic and how it carries with it a lot of negative aspects, and indeed suggest alternatives that will hopefully be someday officialised βΊοΈ
Not to sound too Marxist or such
π
You guys are doing great work. Just getting all the results coming out as text is a huge time saver. BRMS or MCMCglmm integration would be nice one day, but you're already doing lots... thanks!
We have rstanarm support, so brms should be easy to integrate. We'll look at it, but can't promise for initial submission.
If you like, you can try the latest report version from GitHub, which should support brmsfit models. There might be some function that yield an error, this is mostly because you need to define the ci_method
-argument. So mostly our work is to add functins that call the main functions with ci_method
-argument set...
Wahoo, thanks Daniel!
My perspective here is that I've never reported a p value in a paper ;-)
A useful compromise I think would be a set of arguments to control which parameters are interpreted, such as to turn off standardized parameters or p values. (Personally, I'd argue that p value interpretation should be disabled by default, but that's probably not an argument I would win.)
Describe the solution you'd like
Don't use the wording "significant" when describing p-values. See the American Statistical Association statement (https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1154108) and this Nature paper (https://www.nature.com/articles/d41586-019-00857-9) for more detail. This is something that has been stated by statisticians for decades. A threshold based measure of "significant" is inappropriate, leads to bad science, "p-hacking", and potentially harm to health (for studies done in humans and with disease). This should especially not be included in any automating of reporting. "Significance" should be determined by humans in a context dependent manner (e.g. higher threshold for GWAS studies, very high thresholds for particle physics experiments, etc).
How could we do it?
Remove use of
interpret_p()
from the effectsize package and from this package.Thanks :)