Closed DominiqueMakowski closed 4 years ago
I'm drafting a function, cet()
, and reading that paper again. I think it's very similar (if not the same) to Kruschke's equivalence-testing. What I don't exactly understand, is: how is delta calculated in this figure?
https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0195145.g001
The delta in this figure is the rope in equicalence_test()
. The major difference I see is that with a similar approach to equivalence:test()
, we would be more conservative than the suggestions by the CET-paper, if I understand it right. But that's no disadvantage.
My impression is, that the method the authors suggest is a bit complicated and less intuitive. See the third error bar from top in the left panel here, which is inconclusive. The larger (blue) CI crosses the zero, while the smaller CI (organge) does not. However, the 4th errorbar in the group "negative" has the same properties. The only difference is that both upper CIs also lie within the deltas, that's what it apparantly makes it negative instead of inconclusive...
Edit: The rules seems to be: 1) inconclusive: If zero is crossed, but at least on of the smaller CI (orange) are outside the ROPE 2) negative: Zero is crossed and smaller CI (orange) are inside rope. 3) positive: not crossing zero (which means: statistical significance)
We can follow this rule, but at the moment I will write this function in accordance with Kruschke's equi-testing.
I feel like the Conditional ET paradigm is indeed a bit more convoluted than a straight ET,, and after reading the paper I was not sure of the rationale and necessity for this conditional decision rule involving alpha and 2*alpha.
But yes, let's first add "regular" ET for frequentist models, and then we can add the conditional variant.!
Test... Some differences look impressive, but given the small ranges of CI, ratios have larger difference very quickly.
library(bayestestR)
library(parameters)
library(rstanarm)
m1 <- lm(mpg ~ gear + wt + cyl + hp, data = mtcars)
m2 <- stan_glm(mpg ~ gear + wt + cyl + hp, data = mtcars)
equivalence_test(m2, ci = .95)
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.60 0.60]
#>
#> Parameter H0 inside ROPE 95% HDI
#> (Intercept) rejected 0.00 % [24.96 48.78]
#> gear undecided 43.86 % [-1.59 2.32]
#> wt rejected 0.00 % [-4.70 -1.29]
#> cyl undecided 37.38 % [-2.09 0.58]
#> hp accepted 100.00 % [-0.05 0.01]
cet(m1)
#> # Conditional Equivalence Test
#>
#> Region of equivalence: [-0.60 0.60]
#>
#> parameter H0 inside ROPE 95% CI
#> (Intercept) rejected 0.00 % [24.44 48.94]
#> gear undecided 29.37 % [-1.69 2.41]
#> wt rejected 0.00 % [-4.77 -1.28]
#> cyl undecided 42.29 % [-2.17 0.55]
#> hp accepted 100.00 % [-0.05 0.01]
Created on 2019-03-21 by the reprex package (v0.2.1)
But yes, let's first add "regular" ET for frequentist models, and then we can add the conditional variant.!
Yeah, I agree. Yet I think we should probably find another name than cet()
, to avoid confusion? Or is it still ok?
(I can't commit from here, so I will commit later...)
That's interesting! Could the fact that in the Bayesian case in uses hdi and quantile method for the frequentist add to the difference?
For the name, I think it makes sense to keep the equivalence_test()
and rope()
methods and expand them to support freq models. cet
(or equivalence_test(..., conditional=TRUE)
or such) could be reserved for the method described by the paper. What do you think?
Yes, I was also thinking about equivalence_test()
, but hesitating given the fact that arguments were named posterior
. We should rename it into something more general, probably x
? Since we have a numeric-method and a method for models, model
does not quite fit well... hm.
Note that emmeans
has equivalence testing baked into the test()
function via the delta
arg, so it should be super easy to implament this in modelbased
.
library(emmeans)
fit <- lm(Sepal.Length ~ Species, iris)
em_ <- emmeans(fit, ~ Species)
pair_test <- pairs(em_)
test(pair_test)
#> contrast estimate SE df t.ratio p.value
#> setosa - versicolor -0.930 0.103 147 -9.033 <.0001
#> setosa - virginica -1.582 0.103 147 -15.366 <.0001
#> versicolor - virginica -0.652 0.103 147 -6.333 <.0001
#>
#> P value adjustment: tukey method for comparing a family of 3 estimates
test(pair_test, delta = 1)
#> contrast estimate SE df t.ratio p.value
#> setosa - versicolor -0.930 0.103 147 -0.680 0.5761
#> setosa - virginica -1.582 0.103 147 5.653 1.0000
#> versicolor - virginica -0.652 0.103 147 -3.380 0.0014
#>
#> P value adjustment: sidak method for 3 tests
#> Statistics are tests of equivalence with a threshold of 1
#> P values are left-tailed
Created on 2020-04-14 by the reprex package (v0.3.0)
I re-read the paper again. Based on what @DominiqueMakowski wrote in his initial post, and the figure in my answer, it looks like following:
Step 2 (conclusive) seems to be a simple statistically significant result (because CIs are within ROPE, but exclude 0). If the narrow CI is within the ROPE, it's negative, else inconclusive.
I'd say we stick to our rules. I have implemented functions for more frequentist models and for simulated model parameters (https://easystats.github.io/see/articles/parameters.html#equivalence-testing-1), maybe we can do this for bootstrapped parameters as well...
We might take a close look at the Lakens-paper to decide whether we want to investigate this further. I would rather not rely on the emmeans-function for now.
The Lakens paper is not related to regression models (but could maybe interesting for the effectsize package).
@strengejacke Do you mean to use the CIs from effectsize
as criteria?
This would mean to label the output according to some "rules", no? 🤔
Or maybe some in-between - have the CI bounds interpreted with the interpret_*
function?
Sounds like would work well with report (:
I'm re-opening this as I believe the implementation we provide is not the common one (see also https://github.com/easystats/effectsize/issues/70).
In the frequentist NHST logic:
Using CIs, this can be reframed as:
And so, with frequientist equivalence testing: If the null was rejected, and the SESOI was not, then we conclude that H0 is rejected.
Or decision matrix then looks like this:
Halt rejected | Not rejected | |
---|---|---|
Hnull rejected | Accept null* | Reject null |
Not rejected | Accept null | Undecided |
* Even though our result is significantly larger than 0, it is also significantly smaller than SESOI. So we accept the null, as it is defined by the ROPE.
The rules then are:
It seems like the CET paper suggests switching the order of 1 and 2, so:
(sometimes the CI for testing the SESOI is smaller than that used for testing the 0.)
In either case, the weird results that @strengejacke spotted, are a "feature" of the frequentist framework, not of CET in particular.
equivalence_test.lm
should:
ROPE is just a region of practical equivalence, and not specific to Bayesian analysis. Thus, I don't see any problems using the term here.
And looking at the rules from the CET paper (see https://github.com/easystats/parameters/issues/12#issuecomment-475238450), the "positive" results are just results with p < 0.05, while all other are not statistically significant. So there's no difference to classic NHST. From Lakens' paper, it seems to me that there's the same issue, but I'm not sure anyway if these conclusions can be transferred to regression models or not (t-tests and correlation are discussed).
Something is not practically equivalent to zero, if the estimate and the uncertainty in estimation are outside a range that is considered as practically equivalent to zero. That is what equivalence_test()
does, and I think this concept is transferable from Bayesian to frequentist.
I would say we can make this explicit in the docs, but I would not change the function design.
The term ROPE is fine, but the calculation of % in ROPE is not in this instance.
The idea of EQ is transferable to any testing in the frequentist framework.
Something is not practically equivalent to zero, if the estimate and the uncertainty in estimation are outside a range that is considered as practically equivalent to zero. That is what equivalence_test() does
That is not what equivalence testing is in the frequentist framework - in which you cannot conclude anything from lack of significance, only from a significant result. Unlike the Bayesian framework, where you can talk about probabilities, and uncertainty in ranges and estimates...
But in the frequentist framework, all you have are two tests (from the original paper):
If only the first is significant, θ is not 0 (reject). If only the first is significant, θ is at the very least small (accept).
These two conditions are the same.
Sorry, fixed:
I think the point of contention is this:
In Bayes, to reject the null you want to test if |θ| > SESOI (is it outside the rope).
But here the SESOI is not used to reject the null - the point null is used for that. The SESOI is used to accept the null by rejecting the hypothesis that |θ| > SESOI.
Is this dumb? Yes. But is this what people have been doing? Or more importantly, what users would expect from this function? Yes.
But what you're describing is Lakens' rule, not the one from the other CET paper, right? I don't see these rules apply to the figure from my above comment.
To summarise: if a significant coefficient's confidence intervals includes or is greater than the ROPE, this would be not practically equivalent to null?
From Lakens == Hauck and Anderson - yes.
What the CET paper does is say:
Only if the null that |θ| = 0 is not rejected, than test for equivalence.
So this changes the rules a little:
Halt rejected | Not rejected | |
---|---|---|
Hnull rejected | CET: Reject null / ET: Accept null | Reject null |
Not rejected | Accept null | Undecided |
To summarise: if a significant coefficient's confidence intervals includes or is greater than the ROPE, this would be not practically equivalent to null?
Yes, precisely ^_^
2nd Summary:
According to https://github.com/easystats/parameters/issues/12#issuecomment-475238450, relevant parameters are those where the coefficient is a statistically significant coefficient.
This approach just adds an "inconclusive" category.
According to https://github.com/easystats/parameters/issues/12#issuecomment-618342209, relevant parameters are coefficients that statistically significant, but also need to be either large enough (to be outside rope), or have a very high uncertainty (i.e. large CI) to range outside the ROPE?
This approach just distinguishes small effect sizes with low and high uncertainty, where small effect sizes with higher uncertainty are preferred.
My conclusion: I mention in the docs that we do something different and stick to that current implementation, as I don't like both of the above two methods... Maybe we can add an additional argument that mimics the two above approaches.
What do you mean by "relevant parameters"?
This approach just adds an "inconclusive" category.
No - because without EQ, any non-significant result is inconclusive! ET adds the "accept" category! (This literal backwards thinking is such a mindf@#$...)
I think we should give the ET results, and maybe add an option for CET (if so, I will do the same in effectsize). I think having a mix of Bayesian logic on frequentist methods is misleading - think if someone tried to publish anything based on this "mixed" method.... (and the %inROPE does not belong here either - in the frequentists framework you cannot slice up CIs to get probabilities like that...).
Again, I totally agree that this logic has flaws (I mean, I am a developer of bayestestR after all!), but it is what it is... Instead we can add in the docs something like "If you want to truly accept the null, consider becoming a Bayesian \".
(Also I think the p-value feature is redundant when you switch to ET - the label "rejected" is enough...)
(Also I think the p-value feature is redundant when you switch to ET - the label "rejected" is enough...)
(True, but the p-value is adjusted for multiple testing and should at least "mimic" the possible problem of non-dependent parameters)
What do you mean by "relevant parameters"?
Statistically significant and not practically equivalent to zero.
Oh - so close, but you have to flip the thinking to some weird somersault 😅
In ET:
relevant parameters are coefficients that statistically significant, but also need to ~be either large enough (to be outside rope), or have a very high uncertainty (i.e. large CI) to range outside the ROPE~ not have their CI entirely in the ROPE (= not reject the hypothesis that |θ| >= SESOI).
But really both sig-tests can be viewed independently: both sig means non-0, but small.
rules:
"classic"
"cet"
- conditional"bayes"
- test non-equivalenceHows that?
Oh - so close, but you have to flip the thinking to some weird somersault 😅 In ET: relevant parameters are coefficients that statistically significant, but also need to
be either large enough (to be outside rope), or have a very high uncertainty (i.e. large CI) to range outside the ROPEnot have their CI entirely in the ROPE (= not reject the hypothesis that |θ| >= SESOI).
That's what I meant by "large enought", i.e. exceeding the ROPE limits ;-)
Hows that?
Ok, and "classic" = Lakens?
yeah
That's what I meant by "large enought", i.e. exceeding the ROPE limits ;-)
But it's the other way around! It not "large enough" it's "not too small"... 🤪🤪🤪🤪🤪
Anyway... :-D
What would be "accept", "reject" and "undecided"? We have four cases here...
I'd say: A, D=accept; B=reject; C=undecided (though D would be not equivalent, so from the "test perspective", it also could be "reject").
Daniellllllll STOP. BEING. BAYESIAN! Ask what is not rejected!
A + C - Accept B - Reject" D - Undecided
But it's the other way around! It not "large enough" it's "not too small"...
We're approaching from different points, but we arrive at the same goal. ;-)
These are the 3 rules:
library(effectsize)
library(magrittr)
ds <- t_to_d(t = c(0.45, -0.65, -2.2, 2.25, 7),
df_error = c(675, 525, 900, 1875, 2000))
ds %>%
equivalence_test(range = 0.2) %>%
plot()
ds %>%
equivalence_test(range = 0.2, rule = "cet") %>%
plot()
ds %>%
equivalence_test(range = 0.2, rule = "bayes") %>%
plot()
Created on 2020-04-23 by the reprex package (v0.3.0)
For the CET, remember that we have narrow CI's for the 2nd step.
I'm not using the TOAST approach here, just what ever CI as passed (I have no way to recover the original data to do this without breaking the code) 😖
I will add in the docs that users should change the CI level if they want to use the TOAST approach.
I've also added this info in the printing:
library(effectsize)
ds <- t_to_d(t = c(0.45, -0.65, -2.2, 2.25, 7),
df_error = c(675, 525, 900, 1875, 2000),
ci = 0.95)
equivalence_test(ds, range = 0.2)
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.20 0.20]
#>
#> d | 95% CI | H0
#> ----------------------------------
#> 0.03 | [-0.12, 0.19] | Accepted
#> -0.06 | [-0.23, 0.11] | Undecided
#> -0.15 | [-0.28, -0.02] | Rejected
#> 0.10 | [ 0.01, 0.19] | Accepted
#> 0.31 | [ 0.22, 0.40] | Rejected
equivalence_test(ds, range = 0.2, rule = "cet")
#> # Conditional Test for Practical Equivalence
#>
#> ROPE: [-0.20 0.20]
#>
#> d | 95% CI | H0
#> ----------------------------------
#> 0.03 | [-0.12, 0.19] | Accepted
#> -0.06 | [-0.23, 0.11] | Undecided
#> -0.15 | [-0.28, -0.02] | Rejected
#> 0.10 | [ 0.01, 0.19] | Rejected
#> 0.31 | [ 0.22, 0.40] | Rejected
equivalence_test(ds, range = 0.2, rule = "bayes")
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.20 0.20]
#>
#> d | 95% CI | H0
#> ----------------------------------
#> 0.03 | [-0.12, 0.19] | Accepted
#> -0.06 | [-0.23, 0.11] | Undecided
#> -0.15 | [-0.28, -0.02] | Undecided
#> 0.10 | [ 0.01, 0.19] | Accepted
#> 0.31 | [ 0.22, 0.40] | Rejected
#>
#> (Using Bayesian guidlines)
Created on 2020-04-23 by the reprex package (v0.3.0)
I'm not using the TOST approach here, just what ever CI as passed (I have no way to recover the original data to do this without breaking the code) 😖
I will add in the docs that users should change the CI level if they want to use the TOST approach.
I've also added this info in the printing:
library(effectsize)
ds <- t_to_d(t = c(0.45, -0.65, -2.2, 2.25, 7),
df_error = c(675, 525, 900, 1875, 2000),
ci = 0.95)
equivalence_test(ds, range = 0.2)
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.20 0.20]
#>
#> d | 95% CI | H0
#> ----------------------------------
#> 0.03 | [-0.12, 0.19] | Accepted
#> -0.06 | [-0.23, 0.11] | Undecided
#> -0.15 | [-0.28, -0.02] | Rejected
#> 0.10 | [ 0.01, 0.19] | Accepted
#> 0.31 | [ 0.22, 0.40] | Rejected
equivalence_test(ds, range = 0.2, rule = "cet")
#> # Conditional Test for Practical Equivalence
#>
#> ROPE: [-0.20 0.20]
#>
#> d | 95% CI | H0
#> ----------------------------------
#> 0.03 | [-0.12, 0.19] | Accepted
#> -0.06 | [-0.23, 0.11] | Undecided
#> -0.15 | [-0.28, -0.02] | Rejected
#> 0.10 | [ 0.01, 0.19] | Rejected
#> 0.31 | [ 0.22, 0.40] | Rejected
equivalence_test(ds, range = 0.2, rule = "bayes")
#> # Test for Practical Equivalence
#>
#> ROPE: [-0.20 0.20]
#>
#> d | 95% CI | H0
#> ----------------------------------
#> 0.03 | [-0.12, 0.19] | Accepted
#> -0.06 | [-0.23, 0.11] | Undecided
#> -0.15 | [-0.28, -0.02] | Undecided
#> 0.10 | [ 0.01, 0.19] | Accepted
#> 0.31 | [ 0.22, 0.40] | Rejected
#>
#> (Using Bayesian guidlines)
Created on 2020-04-23 by the reprex package (v0.3.0)
The TOST (using narrow CIs) isn't just for the CET, it is also the the "classic".
I'm not talking about TOST (Lakens), but about CET:
Not sure what you mean - TOST is the same as using a narrow CI (the orange CIs) for the equi-test... 🤔
hm, ok. But for "classic"
I'm just using the regular CI, for "cet"
the regular and narrow CI. Should I use the narrow ones only for "classic"
?
I think you should use the narrow one for both (maybe even all 3?)... Or for none, and set the default ci to 0.9? (As in let the user control this and explain in the docs? Which is what I have over at effectsize).
Daniel, I feel like we've been to war together after all of this... I'm tired and I want to go home! (Ignoring the fact that I've been home for nearly 8 weeks...)
TOST is the same as using a narrow CI
Daniel, I feel like we've been to war together after all of this... I'm tired and I want to go home! (Ignoring the fact that I've been home for nearly 8 weeks...)
Yes, yes! Take a break ;-) I'll use the narrow CI for cet and classic, and that's it, I'd say. We can think about changing this later...
Alright!
So I think... I think that's it? Can it be?!!?!
Follow up on here
From the paper: