Conditional Equivalence Testing (CET)

DominiqueMakowski commented 5 years ago

Follow up on here

Similar to equivalence_test() in bayestestR, we might think about implementing a similar method for the frequentists framework. See this paper: https://doi.org/10.1371/journal.pone.0195145

From the paper:

Step 1- Calculate a (1 − α1)% Confidence Interval for θ.
Step 2- If this C.I. excludes θ0, then declare a positive result. Otherwise, if θ0 is within the C.I., proceed to Step 3.
Step 3- Calculate a (1 − 2α2)% Confidence Interval for θ.
Step 4- If this C.I. is entirely within δ, declare a negative result. Otherwise, proceed to Step 5.
Step 5- Declare an inconclusive result. There is insufficient evidence to support any conclusion.

strengejacke commented 5 years ago

I'm drafting a function, cet(), and reading that paper again. I think it's very similar (if not the same) to Kruschke's equivalence-testing. What I don't exactly understand, is: how is delta calculated in this figure?

https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0195145.g001

The delta in this figure is the rope in equicalence_test(). The major difference I see is that with a similar approach to equivalence:test(), we would be more conservative than the suggestions by the CET-paper, if I understand it right. But that's no disadvantage.

My impression is, that the method the authors suggest is a bit complicated and less intuitive. See the third error bar from top in the left panel here, which is inconclusive. The larger (blue) CI crosses the zero, while the smaller CI (organge) does not. However, the 4th errorbar in the group "negative" has the same properties. The only difference is that both upper CIs also lie within the deltas, that's what it apparantly makes it negative instead of inconclusive...

Edit: The rules seems to be: 1) inconclusive: If zero is crossed, but at least on of the smaller CI (orange) are outside the ROPE 2) negative: Zero is crossed and smaller CI (orange) are inside rope. 3) positive: not crossing zero (which means: statistical significance)

We can follow this rule, but at the moment I will write this function in accordance with Kruschke's equi-testing.

DominiqueMakowski commented 5 years ago

I feel like the Conditional ET paradigm is indeed a bit more convoluted than a straight ET,, and after reading the paper I was not sure of the rationale and necessity for this conditional decision rule involving alpha and 2*alpha.

But yes, let's first add "regular" ET for frequentist models, and then we can add the conditional variant.!

strengejacke commented 5 years ago

Test... Some differences look impressive, but given the small ranges of CI, ratios have larger difference very quickly.

library(bayestestR)
library(parameters)
library(rstanarm)

m1 <- lm(mpg ~ gear + wt + cyl + hp, data = mtcars)
m2 <- stan_glm(mpg ~ gear + wt + cyl + hp, data = mtcars)

equivalence_test(m2, ci = .95)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.60 0.60]
#> 
#>    Parameter        H0 inside ROPE       95% HDI
#>  (Intercept)  rejected      0.00 % [24.96 48.78]
#>         gear undecided     43.86 % [-1.59  2.32]
#>           wt  rejected      0.00 % [-4.70 -1.29]
#>          cyl undecided     37.38 % [-2.09  0.58]
#>           hp  accepted    100.00 % [-0.05  0.01]

cet(m1)
#> # Conditional Equivalence Test
#> 
#>   Region of equivalence: [-0.60 0.60]
#> 
#>    parameter        H0 inside ROPE        95% CI
#>  (Intercept)  rejected      0.00 % [24.44 48.94]
#>         gear undecided     29.37 % [-1.69  2.41]
#>           wt  rejected      0.00 % [-4.77 -1.28]
#>          cyl undecided     42.29 % [-2.17  0.55]
#>           hp  accepted    100.00 % [-0.05  0.01]

^{Created on 2019-03-21 by the reprex package (v0.2.1)}

strengejacke commented 5 years ago

But yes, let's first add "regular" ET for frequentist models, and then we can add the conditional variant.!

Yeah, I agree. Yet I think we should probably find another name than cet(), to avoid confusion? Or is it still ok?

(I can't commit from here, so I will commit later...)

DominiqueMakowski commented 5 years ago

That's interesting! Could the fact that in the Bayesian case in uses hdi and quantile method for the frequentist add to the difference?

For the name, I think it makes sense to keep the equivalence_test() and rope() methods and expand them to support freq models. cet (or equivalence_test(..., conditional=TRUE) or such) could be reserved for the method described by the paper. What do you think?

strengejacke commented 5 years ago

Yes, I was also thinking about equivalence_test(), but hesitating given the fact that arguments were named posterior. We should rename it into something more general, probably x? Since we have a numeric-method and a method for models, model does not quite fit well... hm.

strengejacke commented 4 years ago

https://journals.sagepub.com/doi/10.1177/1948550617697177

mattansb commented 4 years ago

Note that emmeans has equivalence testing baked into the test() function via the delta arg, so it should be super easy to implament this in modelbased.

library(emmeans)

fit <- lm(Sepal.Length ~ Species, iris)

em_ <- emmeans(fit, ~ Species)

pair_test <- pairs(em_)

test(pair_test)
#>  contrast               estimate    SE  df t.ratio p.value
#>  setosa - versicolor      -0.930 0.103 147  -9.033 <.0001 
#>  setosa - virginica       -1.582 0.103 147 -15.366 <.0001 
#>  versicolor - virginica   -0.652 0.103 147  -6.333 <.0001 
#> 
#> P value adjustment: tukey method for comparing a family of 3 estimates

test(pair_test, delta = 1)
#>  contrast               estimate    SE  df t.ratio p.value
#>  setosa - versicolor      -0.930 0.103 147 -0.680  0.5761 
#>  setosa - virginica       -1.582 0.103 147  5.653  1.0000 
#>  versicolor - virginica   -0.652 0.103 147 -3.380  0.0014 
#> 
#> P value adjustment: sidak method for 3 tests 
#> Statistics are tests of equivalence with a threshold of 1 
#> P values are left-tailed

^{Created on 2020-04-14 by the reprex package (v0.3.0)}

strengejacke commented 4 years ago

I re-read the paper again. Based on what @DominiqueMakowski wrote in his initial post, and the figure in my answer, it looks like following:

Step 2 (conclusive) seems to be a simple statistically significant result (because CIs are within ROPE, but exclude 0). If the narrow CI is within the ROPE, it's negative, else inconclusive.

I'd say we stick to our rules. I have implemented functions for more frequentist models and for simulated model parameters (https://easystats.github.io/see/articles/parameters.html#equivalence-testing-1), maybe we can do this for bootstrapped parameters as well...

We might take a close look at the Lakens-paper to decide whether we want to investigate this further. I would rather not rely on the emmeans-function for now.

strengejacke commented 4 years ago

The Lakens paper is not related to regression models (but could maybe interesting for the effectsize package).

mattansb commented 4 years ago

@strengejacke Do you mean to use the CIs from effectsize as criteria? This would mean to label the output according to some "rules", no? 🤔

Or maybe some in-between - have the CI bounds interpreted with the interpret_* function? Sounds like would work well with report (:

mattansb commented 4 years ago

I'm re-opening this as I believe the implementation we provide is not the common one (see also https://github.com/easystats/effectsize/issues/70).

Frequentist "logic"

In the frequentist NHST logic:

If P(data | H_x) < α
Then reject H_x

Using CIs, this can be reframed as:

If CI_1-α does not contain the value expected by H_x
Then reject H_x

And so, with frequientist equivalence testing: If the null was rejected, and the SESOI was not, then we conclude that H0 is rejected.

Or decision matrix then looks like this:

	H_alt rejected	Not rejected
H_null rejected	Accept null*	Reject null
Not rejected	Accept null	Undecided

* Even though our result is significantly larger than 0, it is also significantly smaller than SESOI. So we accept the null, as it is defined by the ROPE.

From Laken's paper:

The rules

The rules then are:

If CI does not contain the SESOI (or larger) - accept null.
Else, if CI does not contain 0 - reject the null.
Else, inconclusive.

It seems like the CET paper suggests switching the order of 1 and 2, so:

Else, if CI does not contain 0 - reject the null.
If CI does not contain the SESOI (or larger) - accept null.
Else, inconclusive.

(sometimes the CI for testing the SESOI is smaller than that used for testing the 0.)

In either case, the weird results that @strengejacke spotted, are a "feature" of the frequentist framework, not of CET in particular.

Required changes

equivalence_test.lm should:

remove the %inROPE, which is a very Bayesian and not appropriate here
only have the labels, as dictated by the rules above.

strengejacke commented 4 years ago

ROPE is just a region of practical equivalence, and not specific to Bayesian analysis. Thus, I don't see any problems using the term here.

And looking at the rules from the CET paper (see https://github.com/easystats/parameters/issues/12#issuecomment-475238450), the "positive" results are just results with p < 0.05, while all other are not statistically significant. So there's no difference to classic NHST. From Lakens' paper, it seems to me that there's the same issue, but I'm not sure anyway if these conclusions can be transferred to regression models or not (t-tests and correlation are discussed).

Something is not practically equivalent to zero, if the estimate and the uncertainty in estimation are outside a range that is considered as practically equivalent to zero. That is what equivalence_test() does, and I think this concept is transferable from Bayesian to frequentist.

I would say we can make this explicit in the docs, but I would not change the function design.

mattansb commented 4 years ago

The term ROPE is fine, but the calculation of % in ROPE is not in this instance.

The idea of EQ is transferable to any testing in the frequentist framework.

Something is not practically equivalent to zero, if the estimate and the uncertainty in estimation are outside a range that is considered as practically equivalent to zero. That is what equivalence_test() does

That is not what equivalence testing is in the frequentist framework - in which you cannot conclude anything from lack of significance, only from a significant result. Unlike the Bayesian framework, where you can talk about probabilities, and uncertainty in ranges and estimates...

But in the frequentist framework, all you have are two tests (from the original paper):

Is |θ| > 0? (Can we reject the null: |θ| = 0)
Is |θ| < SESOI? (Can we reject the null: |θ| > SESOI)

If both are significant, than θ is not 0, but it at the very least small (accept).
If only the first is significant, θ is not 0 (reject).
If only the first is significant, θ is at the very least small (accept).
If neither, then the results are inconclusive (undecided).

strengejacke commented 4 years ago

If only the first is significant, θ is not 0 (reject). If only the first is significant, θ is at the very least small (accept).

These two conditions are the same.

mattansb commented 4 years ago

Sorry, fixed:

If both are significant, than θ is not 0, but it at the very least small (accept).
If only the first is significant, θ is not 0 (reject).
If only the second is significant, θ is at the very least small (accept).
If neither, then the results are inconclusive (undecided).

I think the point of contention is this:

In Bayes, to reject the null you want to test if |θ| > SESOI (is it outside the rope).

But here the SESOI is not used to reject the null - the point null is used for that. The SESOI is used to accept the null by rejecting the hypothesis that |θ| > SESOI.

Is this dumb? Yes. But is this what people have been doing? Or more importantly, what users would expect from this function? Yes.

strengejacke commented 4 years ago

But what you're describing is Lakens' rule, not the one from the other CET paper, right? I don't see these rules apply to the figure from my above comment.

strengejacke commented 4 years ago

To summarise: if a significant coefficient's confidence intervals includes or is greater than the ROPE, this would be not practically equivalent to null?

mattansb commented 4 years ago

From Lakens == Hauck and Anderson - yes.

What the CET paper does is say:

Only if the null that |θ| = 0 is not rejected, than test for equivalence.

So this changes the rules a little:

	H_alt rejected	Not rejected
H_null rejected	CET: Reject null / ET: Accept null	Reject null
Not rejected	Accept null	Undecided

To summarise: if a significant coefficient's confidence intervals includes or is greater than the ROPE, this would be not practically equivalent to null?

Yes, precisely ^_^

strengejacke commented 4 years ago

2nd Summary:

According to https://github.com/easystats/parameters/issues/12#issuecomment-475238450, relevant parameters are those where the coefficient is a statistically significant coefficient.

This approach just adds an "inconclusive" category.

According to https://github.com/easystats/parameters/issues/12#issuecomment-618342209, relevant parameters are coefficients that statistically significant, but also need to be either large enough (to be outside rope), or have a very high uncertainty (i.e. large CI) to range outside the ROPE?

This approach just distinguishes small effect sizes with low and high uncertainty, where small effect sizes with higher uncertainty are preferred.

My conclusion: I mention in the docs that we do something different and stick to that current implementation, as I don't like both of the above two methods... Maybe we can add an additional argument that mimics the two above approaches.

mattansb commented 4 years ago

What do you mean by "relevant parameters"?

This approach just adds an "inconclusive" category.

No - because without EQ, any non-significant result is inconclusive! ET adds the "accept" category! (This literal backwards thinking is such a mindf@#$...)

I think we should give the ET results, and maybe add an option for CET (if so, I will do the same in effectsize). I think having a mix of Bayesian logic on frequentist methods is misleading - think if someone tried to publish anything based on this "mixed" method.... (and the %inROPE does not belong here either - in the frequentists framework you cannot slice up CIs to get probabilities like that...).

Again, I totally agree that this logic has flaws (I mean, I am a developer of bayestestR after all!), but it is what it is... Instead we can add in the docs something like "If you want to truly accept the null, consider becoming a Bayesian \".

mattansb commented 4 years ago

(Also I think the p-value feature is redundant when you switch to ET - the label "rejected" is enough...)

strengejacke commented 4 years ago

(Also I think the p-value feature is redundant when you switch to ET - the label "rejected" is enough...)

(True, but the p-value is adjusted for multiple testing and should at least "mimic" the possible problem of non-dependent parameters)

strengejacke commented 4 years ago

What do you mean by "relevant parameters"?

Statistically significant and not practically equivalent to zero.

mattansb commented 4 years ago

Oh - so close, but you have to flip the thinking to some weird somersault 😅

In ET:

relevant parameters are coefficients that statistically significant, but also need to ~be either large enough (to be outside rope), or have a very high uncertainty (i.e. large CI) to range outside the ROPE~ not have their CI entirely in the ROPE (= not reject the hypothesis that |θ| >= SESOI).

mattansb commented 4 years ago

But really both sig-tests can be viewed independently: both sig means non-0, but small.

mattansb commented 4 years ago

rules:

"classic"
"cet" - conditional
"bayes" - test non-equivalence

Hows that?

strengejacke commented 4 years ago

Oh - so close, but you have to flip the thinking to some weird somersault 😅 In ET: relevant parameters are coefficients that statistically significant, but also need to ~~be either large enough (to be outside rope), or have a very high uncertainty (i.e. large CI) to range outside the ROPE~~ not have their CI entirely in the ROPE (= not reject the hypothesis that |θ| >= SESOI).

That's what I meant by "large enought", i.e. exceeding the ROPE limits ;-)

strengejacke commented 4 years ago

Hows that?

Ok, and "classic" = Lakens?

mattansb commented 4 years ago

yeah

mattansb commented 4 years ago

That's what I meant by "large enought", i.e. exceeding the ROPE limits ;-)

But it's the other way around! It not "large enough" it's "not too small"... 🤪🤪🤪🤪🤪

strengejacke commented 4 years ago

Anyway... :-D

What would be "accept", "reject" and "undecided"? We have four cases here...

strengejacke commented 4 years ago

I'd say: A, D=accept; B=reject; C=undecided (though D would be not equivalent, so from the "test perspective", it also could be "reject").

mattansb commented 4 years ago

Daniellllllll STOP. BEING. BAYESIAN! Ask what is not rejected!

A + C - Accept B - Reject" D - Undecided

strengejacke commented 4 years ago

But it's the other way around! It not "large enough" it's "not too small"...

We're approaching from different points, but we arrive at the same goal. ;-)

mattansb commented 4 years ago

These are the 3 rules:

library(effectsize)
library(magrittr)

ds <- t_to_d(t = c(0.45, -0.65, -2.2, 2.25, 7),
       df_error = c(675, 525, 900, 1875, 2000))

ds %>% 
  equivalence_test(range = 0.2) %>% 
  plot()


ds %>% 
  equivalence_test(range = 0.2, rule = "cet") %>% 
  plot()


ds %>% 
  equivalence_test(range = 0.2, rule = "bayes") %>% 
  plot()

^{Created on 2020-04-23 by the reprex package (v0.3.0)}

strengejacke commented 4 years ago

For the CET, remember that we have narrow CI's for the 2nd step.

mattansb commented 4 years ago

I'm not using the TOAST approach here, just what ever CI as passed (I have no way to recover the original data to do this without breaking the code) 😖

I will add in the docs that users should change the CI level if they want to use the TOAST approach.

I've also added this info in the printing:

library(effectsize)

ds <- t_to_d(t = c(0.45, -0.65, -2.2, 2.25, 7),
             df_error = c(675, 525, 900, 1875, 2000),
             ci = 0.95)

equivalence_test(ds, range = 0.2)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.20 0.20]
#> 
#>     d |         95% CI |        H0
#> ----------------------------------
#>  0.03 | [-0.12,  0.19] |  Accepted
#> -0.06 | [-0.23,  0.11] | Undecided
#> -0.15 | [-0.28, -0.02] |  Rejected
#>  0.10 | [ 0.01,  0.19] |  Accepted
#>  0.31 | [ 0.22,  0.40] |  Rejected

equivalence_test(ds, range = 0.2, rule = "cet")
#> # Conditional Test for Practical Equivalence
#> 
#>   ROPE: [-0.20 0.20]
#> 
#>     d |         95% CI |        H0
#> ----------------------------------
#>  0.03 | [-0.12,  0.19] |  Accepted
#> -0.06 | [-0.23,  0.11] | Undecided
#> -0.15 | [-0.28, -0.02] |  Rejected
#>  0.10 | [ 0.01,  0.19] |  Rejected
#>  0.31 | [ 0.22,  0.40] |  Rejected

equivalence_test(ds, range = 0.2, rule = "bayes")
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.20 0.20]
#> 
#>     d |         95% CI |        H0
#> ----------------------------------
#>  0.03 | [-0.12,  0.19] |  Accepted
#> -0.06 | [-0.23,  0.11] | Undecided
#> -0.15 | [-0.28, -0.02] | Undecided
#>  0.10 | [ 0.01,  0.19] |  Accepted
#>  0.31 | [ 0.22,  0.40] |  Rejected
#> 
#> (Using Bayesian guidlines)

^{Created on 2020-04-23 by the reprex package (v0.3.0)}

mattansb commented 4 years ago

I'm not using the TOST approach here, just what ever CI as passed (I have no way to recover the original data to do this without breaking the code) 😖

I will add in the docs that users should change the CI level if they want to use the TOST approach.

I've also added this info in the printing:

library(effectsize)

ds <- t_to_d(t = c(0.45, -0.65, -2.2, 2.25, 7),
             df_error = c(675, 525, 900, 1875, 2000),
             ci = 0.95)

equivalence_test(ds, range = 0.2)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.20 0.20]
#> 
#>     d |         95% CI |        H0
#> ----------------------------------
#>  0.03 | [-0.12,  0.19] |  Accepted
#> -0.06 | [-0.23,  0.11] | Undecided
#> -0.15 | [-0.28, -0.02] |  Rejected
#>  0.10 | [ 0.01,  0.19] |  Accepted
#>  0.31 | [ 0.22,  0.40] |  Rejected

equivalence_test(ds, range = 0.2, rule = "cet")
#> # Conditional Test for Practical Equivalence
#> 
#>   ROPE: [-0.20 0.20]
#> 
#>     d |         95% CI |        H0
#> ----------------------------------
#>  0.03 | [-0.12,  0.19] |  Accepted
#> -0.06 | [-0.23,  0.11] | Undecided
#> -0.15 | [-0.28, -0.02] |  Rejected
#>  0.10 | [ 0.01,  0.19] |  Rejected
#>  0.31 | [ 0.22,  0.40] |  Rejected

equivalence_test(ds, range = 0.2, rule = "bayes")
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.20 0.20]
#> 
#>     d |         95% CI |        H0
#> ----------------------------------
#>  0.03 | [-0.12,  0.19] |  Accepted
#> -0.06 | [-0.23,  0.11] | Undecided
#> -0.15 | [-0.28, -0.02] | Undecided
#>  0.10 | [ 0.01,  0.19] |  Accepted
#>  0.31 | [ 0.22,  0.40] |  Rejected
#> 
#> (Using Bayesian guidlines)

^{Created on 2020-04-23 by the reprex package (v0.3.0)}

mattansb commented 4 years ago

The TOST (using narrow CIs) isn't just for the CET, it is also the the "classic".

strengejacke commented 4 years ago

I'm not talking about TOST (Lakens), but about CET:

mattansb commented 4 years ago

Not sure what you mean - TOST is the same as using a narrow CI (the orange CIs) for the equi-test... 🤔

strengejacke commented 4 years ago

hm, ok. But for "classic" I'm just using the regular CI, for "cet" the regular and narrow CI. Should I use the narrow ones only for "classic"?

mattansb commented 4 years ago

I think you should use the narrow one for both (maybe even all 3?)... Or for none, and set the default ci to 0.9? (As in let the user control this and explain in the docs? Which is what I have over at effectsize).

Daniel, I feel like we've been to war together after all of this... I'm tired and I want to go home! (Ignoring the fact that I've been home for nearly 8 weeks...)

mattansb commented 4 years ago

TOST is the same as using a narrow CI

strengejacke commented 4 years ago

Daniel, I feel like we've been to war together after all of this... I'm tired and I want to go home! (Ignoring the fact that I've been home for nearly 8 weeks...)

Yes, yes! Take a break ;-) I'll use the narrow CI for cet and classic, and that's it, I'd say. We can think about changing this later...

mattansb commented 4 years ago

Alright!

So I think... I think that's it? Can it be?!!?!

strengejacke commented 4 years ago

mattansb commented 4 years ago

easystats / parameters