easystats / bayestestR

:ghost: Utilities for analyzing Bayesian models and posterior distributions
https://easystats.github.io/bayestestR/
GNU General Public License v3.0
571 stars 55 forks source link

ROPE sensitivity to scales? #144

Closed mattansb closed 5 years ago

mattansb commented 5 years ago

From ROPE article:

However, keep in mind that unlike indices of effect existence (such as the pd), indices of effect size, such as the ROPE percentage, depend on the unit of its parameter, and can thus be easily changed via the scaling of the predictors.

  1. I'm far from a ROPE expert, but shouldn't %ROPE be in theory not sensitive to scale? Shouldn't the doc suggest that the ROPE range be adjusted according the the IV's scale?

  2. Also, I'm not sure how %ROPE is a measure of effect size - it only indicated the degree by which the effect is not small, but give no indication of the size the effect (if it is medium, or large).

(I've only now started reading Kruschke's book, after reading some of his papers, so sorry if these are dumb Qs)

mattansb commented 5 years ago
  1. I guess my question is: if the ROPE range is adjusted for the scale of the DV, why not for the IV? (Maybe this can also be implemented with something like what is used for parameters::standardize_parameters(x, method = "full")?
strengejacke commented 5 years ago

@1 The rope range is not adjusted for the range of the DV, it is based on that range (at least for linear models). If you standardize your predictors, you'll change the results from the rope, as the rope range does not change (since it's based on the unchanged DV), but the posterior from the scaled IV changed. So "everything working as intended". For p_direction, it doesn't matter if an IV is scaled or not.

strengejacke commented 5 years ago

Also, I'm not sure how %ROPE is a measure of effect size

True, maybe we should revise the wording here?

DominiqueMakowski commented 5 years ago

Also, I'm not sure how %ROPE is a measure of effect size

I might have been unclear, let me try to rephrase:

the rope percentage is an index of significance (in the sense of "practical importance", rather than some esoteric and conceptual definition), thus it tries to discriminate between significant and non-significant effects. It is not an index of effect size per se. However it uses effect size to define what significance is. It defines a region assimilable to null (or negligible) effect, based solely on the effect's magnitude (I.e., size). Hence it is not directly an index of effect size, but it uses effect size to define significance. Due to this close relationship, both the arguments for and against the focus on effect size apply.

what do you think?

but yes that is worth reformulating and detailing in the docs.

mattansb commented 5 years ago

If I understand the both of you, the %ROPE is a measure of significance, whose criterion is based on effect size.

It still seems to me like the ROPE range should be based on the scale of both the DV and the IV. Else we can have a situation like this, where the posterior falls completely in the ROPE, but the effect is anything but practically equivalent to 0:

library(bayestestR)
library(rstanarm)

X <- rnorm(200)
Y <- X + rnorm(200, sd = .1)

df <- data.frame(X = X*100,Y)

junk <- capture.output(
  stan_model <- stan_glm(Y ~ X, df, family = gaussian())
)

equivalence_test(stan_model)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.10 0.10]
#> 
#>    Parameter       H0 inside ROPE      89% HDI
#>  (Intercept) accepted    100.00 % [-0.01 0.01]
#>            X accepted    100.00 % [ 0.01 0.01]
performance::r2(stan_model)
#> # Bayesian R2 with Standard Error
#> 
#>   Conditional R2: 0.991 [0.000]

Created on 2019-05-22 by the reprex package (v0.2.1)

mattansb commented 5 years ago

At the very least, I suggest adding a "warning" in the test of the vignette and the .rds along the lines of:

However, keep in mind that unlike other indices of effect existence (such as the pd), ROPE percentage depend on the unit of its parameter, thus correct interpretation of the ROPE as representing a region of practical equivalence to zero is dependent on the scaling of the predictors.

In the vignette, the current example of not-sig to sig could be accompanied by an explanation that even though after the scaling the result is sig, the effect still is practically 0 (or something). Maybe also add the example I provided above of the opposite happening?

(I'm thinking of the layperson who might use rope or equivalence_test without giving too much thought...)

strengejacke commented 5 years ago

It still seems to me like the ROPE range should be based on the scale of both the DV and the IV.

The ROPE is (relativley) clearly defined, so we can't change this. The idea behind the ROPE is similar to the "clinical" (not statistical) difference of effects: if effects differ by at least 1/2 SD, they're probably also practically relevant: https://www.ncbi.nlm.nih.gov/pubmed/19807551

In our case, we don't have .5 SD, but .1 SD, but we're not comparing point estimates thar are .5 SD away from zero, but the (almost) whole posterior distribution that should not cover the ROPE (+/-1 .1 SD around zero). So indeed, it seems like this "test" is sensible to scaling - here, frequentist methods may have an advantage of being more stable.

We could probably check the SD of the response, and if it's approx. 1, we might indicate a warning that scaled DVs (and scaled IVs) may bias the results. At least, we should also add the text from the vignette to the docs.

DominiqueMakowski commented 5 years ago

Yes, I am not convinced by throwing too many warnings either. After all, it is not this package's role to judge if what the user does makes sense or not. I would prefer to properly document things and leave warnings to a minimum :)

strengejacke commented 5 years ago

agree.

strengejacke commented 5 years ago

else, one would end up with warning for every single function (lm() - don't trust p-values!, confint() - not precise if you assume normal-dist... etc.)

mattansb commented 5 years ago

Right, I also don't think the functions themselves should give a warning (because as @strengejacke says, where would this end???). But I do think that this sensitivity to scale (and how to account for it) should be documented in the vignette and the .rds.

DominiqueMakowski commented 5 years ago

But I do think that this sensitivity to scale (and how to account for it) should be documented in the vignette and the .rds.

Yes, as a matter of fact it would be good to have the documentation, the docstrings (the .rd) and the README up to date with one another :) But it's a thing we will revise and improve over and over.

DominiqueMakowski commented 5 years ago

I think we can close this?

mattansb commented 5 years ago

Has any of the suggested clarification been made to the doc/vignette?

DominiqueMakowski commented 5 years ago

I added a bit here https://easystats.github.io/bayestestR/articles/region_of_practical_equivalence.html

mattansb commented 5 years ago

It still says rope is a measure of effect size. Do you mind if I make some minor tweaks and additions? You can then remove them if you don't this they're appropriate.

-- Mattan S. Ben-Shachar, PhD student Department of Psychology & Zlotowski Center for Neuroscience Ben-Gurion University of the Negev The Developmental ERP Lab

On Thu, May 23, 2019, 17:34 Dominique Makowski notifications@github.com wrote:

I added a bit here https://easystats.github.io/bayestestR/articles/region_of_practical_equivalence.html

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/easystats/bayestestR/issues/144?email_source=notifications&email_token=AINRP6G334W5SDSFJUM7WL3PW2TOPA5CNFSM4HONOSHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWCNOCA#issuecomment-495245064, or mute the thread https://github.com/notifications/unsubscribe-auth/AINRP6CNV7D5XQAF74NZJ33PW2TOPANCNFSM4HONOSHA .

DominiqueMakowski commented 5 years ago

Yes yes please go ahead 😅

mattansb commented 5 years ago

@DominiqueMakowski take a look. I was trying to channel (from the rope doc):

Kruschke (2018) suggests that such null value could be set, by default, to the -0.1 to 0.1 range of a standardized parameter

In the sense that a standardized parameter is standardized based both on the DV and the predictors.

strengejacke commented 5 years ago

Standardizing DV makes no sense, because a standardized variables has a SD of 1, and then you don't need the formula 0.1 * SD.

strengejacke commented 5 years ago

For parameters that have the same scale as the data, it is relatively straightforward to think about a ROPE. For example, in the case of IQ scores with a normal distribution, the mean, μ, is on the IQ scale, and its ROPE limits are in IQ points. Other models may have parameters that are less directly related to the scales of the data, and therefore ROPE limits may need to be derived more indirectly. Consider linear regression. We might want to say that a regression coefficient, βx, is practically equivalent to zero if a change across the “main range of x” produces only a negligible change in the predicted value, yˆ. Suppose we specify a negligible change in yˆ as ±0.1Sy, where Sy is the standard deviation of y (a range that may be motivated by the convention that 0.1S is half of a “small” effect), and we specify the “main range of x” as Mx ± 2Sx (because if x were normally distributed, this range would cover just over 95% of the distribution). Given these specifications, a regression coefficient is practically equivalent to zero when a change of x from Mx – 2Sx to Mx + 2Sx yields a change of yˆ only from My – 0.1Sy to My + 0.1Sy, which implies ROPE limits of βx = ±0.05 for standardized variables.

Kruschke 2018, p.277

mattansb commented 5 years ago

This quote from Kruschke is exactly what I was trying to say - to ROPE range needs to also account for the scale of the predictor... or first standardize the coefficients, or something - any way you put it is just two sides of the same coin - you can either standardize the DV and predictor and have rope [-0.1 0.1], or don't, and have rope (Sy/Sx)*[-0.1 0.1].

strengejacke commented 5 years ago

So, why haven't you said this clearly before? :-D

strengejacke commented 5 years ago

Kruschke mentions .05 for standardized variables, so we can probably check different models that have unstandardized variables, compare the same models with standardized variables and ROPE-limit +/-.05 and see if the percentage coverage is similar.

If so, we just need to check the model for standardized variables... a quick check would indeed be checking the SD for being ~1.

strengejacke commented 5 years ago
library(rstanarm)
library(bayestestR)

data(mtcars)
mtcars_z <- parameters::standardize(mtcars)

m1a <- stan_glm(disp ~ mpg + hp + qsec + drat, data = mtcars, seed = 123)
m1b <- stan_glm(disp ~ mpg + hp + qsec + drat, data = mtcars_z, seed = 123)

m2a <- stan_glm(mpg ~ hp + wt + drat, data = mtcars, seed = 123)
m2b <- stan_glm(mpg ~ hp + wt + drat, data = mtcars_z, seed = 123)

equivalence_test(m1a)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-12.39 12.39]
#> 
#>    Parameter        H0 inside ROPE          89% HDI
#>  (Intercept)  rejected      0.00 % [  60.87 848.51]
#>          mpg  accepted    100.00 % [ -12.13  -1.07]
#>           hp  accepted    100.00 % [   0.20   1.26]
#>         qsec undecided     91.27 % [ -13.57  15.38]
#>         drat  rejected      0.00 % [-113.89 -22.76]

equivalence_test(m1b)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.10 0.10]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept) undecided     85.71 % [-0.14  0.13]
#>          mpg undecided      5.34 % [-0.59 -0.03]
#>           hp  rejected      0.00 % [ 0.14  0.72]
#>         qsec undecided     59.73 % [-0.19  0.23]
#>         drat undecided      0.98 % [-0.50 -0.09]

equivalence_test(m1b, range = c(-.05, .05))
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.05 0.05]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept) undecided     49.76 % [-0.14  0.13]
#>          mpg undecided      1.74 % [-0.59 -0.03]
#>           hp  rejected      0.00 % [ 0.14  0.72]
#>         qsec undecided     33.98 % [-0.19  0.23]
#>         drat  rejected      0.00 % [-0.50 -0.09]

equivalence_test(m2a)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.60 0.60]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept)  rejected      0.00 % [19.32 39.68]
#>           hp  accepted    100.00 % [-0.05 -0.02]
#>           wt  rejected      0.00 % [-4.46 -1.86]
#>         drat undecided     17.24 % [-0.48  3.60]

equivalence_test(m2b)
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.10 0.10]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept) undecided     91.15 % [-0.13  0.11]
#>           hp  rejected      0.00 % [-0.53 -0.19]
#>           wt  rejected      0.00 % [-0.73 -0.32]
#>         drat undecided     33.39 % [-0.03  0.33]

equivalence_test(m2b, range = c(-.05, .05))
#> # Test for Practical Equivalence
#> 
#>   ROPE: [-0.05 0.05]
#> 
#>    Parameter        H0 inside ROPE       89% HDI
#>  (Intercept) undecided     53.97 % [-0.13  0.11]
#>           hp  rejected      0.00 % [-0.53 -0.19]
#>           wt  rejected      0.00 % [-0.73 -0.32]
#>         drat undecided     15.87 % [-0.03  0.33]

Created on 2019-05-23 by the reprex package (v0.3.0)

strengejacke commented 5 years ago

Somehow, this sheds doubts on the usefulnes of the equivalence-testing...

strengejacke commented 5 years ago

On the other hand, these results may indicate that multicollinearity or disregarding "joint" distributions are a problem, and probably the "standardized" model gives more reliable results?

mattansb commented 5 years ago

I guess I really should finish reading Kruschke's book ^_^

If a small (standardized) effect size is 0.2, and half of that is 0.1, why isn't the ROPE range defined as (Sy/Sx) * [-0.1 0.1]?? How then is the standardized ROPE range [-0.05 0.05]? This section seems off to me:

...“main range of x” as Mx ± 2Sx (because if x were normally distributed, this range would cover just over 95% of the distribution).

Is he defining the posterior range of the parameter based on Sx? That seems wrong - those two things are unrelated...

DominiqueMakowski commented 5 years ago

On the other hand, these results may indicate that multicollinearity or disregarding "joint" distributions are a problem, and probably the "standardized" model gives more reliable results?

I am not sure in your example that multicollinearity is the problem involved (and I am not sure that standardization helps address multicollinearity).

After reading the previous comments and quotes, let us try to reformulate here. The ROPE is a region of negligible effect. That means, that adding 1 to the (linear) predictor (which is what regression coefs represent) will result in a negligible effect, ie*, a negligible change of the response variable. Thus, the negligible region must refer to the response and its scale, as it is the case currently.

But does the predictor needs to be scaled? Well, I am a big fan of standardization and standardized parameters. Acknowledging their limits, I still find them very useful and informative and that's why I recommend computing them by default in parameters. As a general rule, yes, of course, to me it makes more sense when the "adding 1 to the (linear) predictor" mentioned above means "moving by 1 SD". This way, you're comparing a statistically meaningful change (change in variance of the predictor) to a meaningful effect (change in variance of the response). Otherwise, people might just feed in raw reaction times in milliseconds as predictors and be surprised that their equivalence test is never significant... Well indeed, moving by 1 ms is hardly related to non-negligible effects. So I definitely think that standardizing variables should be indeed considered whenever using ROPE.

With that being said, I do not think we should enforce this rule and makes it automatic or default. First of all, standardization should be avoided "behind the scenes", as it is not always appropriate nor meaningful (e.g., in the case of non-normal distributions etc). Moreover, many people do not need standardization as the raw unit of their parameters make sense to them and are appropriate. I don't know, if you're studying the pizza eating behaviour and you have number of pizzas / week as a predictor, then the raw parameter, corresponding to one pizza by week more, probably makes sense (to the researcher of that topic). So automatically standardizing this under the carpet would be 1) not easy to implement (as correct standardization is not that straightforward to do, which is why it is one of the main features of parameters ) and 2) would feel like too imposing for the user.

Thus, the course of action I would suggest is not to change the current behaviour of ROPE and, at the same time, emphasizing its close relationship with the scale of the parameter in the docs (maybe expanding or detailing the current paragraph). This way, the user can understand what the values mean, and he can make an informed decision about what is appropriate in his particular case.

DominiqueMakowski commented 5 years ago

Nevertheless, I could envision in parameters::model_parameters() an argument like rope_standardized = TRUE that would return the ROPE / equi test results based on the standardized parameters, which the function computes anyway. But I don't think doing this in bayestestR is a good idea, which scope is more to provide tools and information rather than implementing and forcing "good practices", as it becomes the case a bit more in other, more directive, packages.

This would need to clean up the code for other types of standardization (metho="full") as it is currently quite messy and doesn't allow to standardize whole posteriors, only point-estimates. But that could be done.

DominiqueMakowski commented 5 years ago

Eventually, although I would probably vote against it, we could just check if the SD of the predictor is far from 1. If it is indeed (let's say more than 10 / less than 0.1 or something), we could throw a warning like "The scale of your predictor is somethingsomething. The ROPE results might be unreliable. Consider standardizing your data."

mattansb commented 5 years ago

But does the predictor needs to be scaled?

I think something needs to be done - we don't have to scale the predictor, we can (reverse-)scale the rope range (so rope_range would return a different range for each parameter).

...emphasizing its close relationship with the scale of the parameter in the docs ...

Absolutely!

I do not think we should enforce this rule and makes it automatic or default.

I disagree here - if we don't do this, we set up our users for a wrong inference, as the default rope range is only appropriate if Sx=1, which is unlikely (improbable? Ha!) to be the case...

strengejacke commented 5 years ago

I think we could mail John kruschke. I can do this. What do you think?

mattansb commented 5 years ago

I think this is an excellent idea! Ask him what he suggest as a default behavior for the rope range / scaling etc...

DominiqueMakowski commented 5 years ago

I think we could mail John kruschke. I can do this. What do you think?

Sounds fair :)

strengejacke commented 5 years ago

Thus, the course of action I would suggest is not to change the current behaviour of ROPE and, at the same time, emphasizing its close relationship with the scale of the parameter in the docs (maybe expanding or detailing the current paragraph).

I agree, I think, no matter what reply we will get, keep the current behaviour and explain in the docs.

strengejacke commented 5 years ago

"The scale of your predictor is somethingsomething. The ROPE results might be unreliable. Consider standardizing your data."

My impression is that the opposite is true, i.e. after standardizing, results might be somewhat strange?

DominiqueMakowski commented 5 years ago

I agree, I think, no matter what reply we will get, keep the current behaviour and explain in the docs.

Yup

As they say in Singapore, "see how" ☺️ (we'll see what then)

mattansb commented 5 years ago

How about "The ROPE range might not be appropriate for the scale of your predictor. The ROPE results might be unreliable. Consider standardizing your data. See ?rope for more details."

strengejacke commented 5 years ago

Consider standardizing your data

I think this is not in line with the basic idea of ROPE+HDI decision rule. Similar to defining priors, you should rather know your data, having your data in shape, and then the .1 * SD(y) rule can be applied.

The quote starts with:

For parameters that have the same scale as the data, it is relatively straightforward to think about a ROPE.

My impression is that standardized variables may be misleading, because not really straightforward, and we could probably warn if the SD of y is one (i.e. we assume standardized variables).

Taking an example of a paper I just submitted: We have quality of life (QoL) as outcome, score ranging from 0-100. From research we know that a difference of ~6 points can be considered as clinically relevant, i.e. if you can improve QoL by ~6 points after any intervention, there seems to be a "relevant benefit" for patients. Of course, you may get a "significant" difference already at smaller deviations, because statistical and clinical relevance are not the same.

For the equivalence test, the "negligable range" was approx. -2 to +2 (i.e. 0.1 * SD(qualidem)). According to Kruschke, if 95% of the distribution lie outside this range (95% = "point estimate" + ~2*SD of x, assuming normal distribution), you have a "significant" difference from NULL. Now this is what makes sense, we have a range of about 4 points, and everything within this range is definitely clinically (and probably also statistically) not significant - and everything here is not standardized, that's why I can interprete these values in such a straightforward way.

Unbenannt

I will check later, but I'm not sure if standardizing helps here and would allow me the same straightforward interpretation.

strengejacke commented 5 years ago

This is probably also a good read: http://www.indiana.edu/~kruschke/KruschkeFreqAndBayesAppTutorial.html

mattansb commented 5 years ago

I'm sorry - I must be missing something... Unless all your predictors are on the same scale, how does it make sense to you the same rope range for all of them? A change in a unit of X has totally different meaning depending on the scale of X...

Also is this:

95% of the distribution

Referring to the HDI? Why is this defined as point + 2S?

strengejacke commented 5 years ago

Unless all your predictors are on the same scale, how does it make sense to you the same rope range for all of them?

Big confusion - I centered variables, not standardized them...

DominiqueMakowski commented 5 years ago

I'm sorry - I must be missing something... Unless all your predictors are on the same scale, how does it make sense to you the same rope range for all of them? A change in a unit of X has totally different meaning depending on the scale of X...

I think that @mattansb was asking about what the scale of X, since the posterior represent a given change here in QoL related to shift of 1 of the (continuous) predictor. For example, assuming that length of stay is a continuous predictor expressed is in days, your posterior is related to the effect of 1 day of length of stay more. However, if you express your length of stay in let's say years, then your percentage in ROPE will be highly shrunk, and it might completely shift outside of the ROPE. Again, if you convert the unit to seconds, then the posterior will be inflated because 1 second of length of stay more is not the same as 1 year of length of stay more.

mattansb commented 5 years ago

The main point of contention is this: if the recommended smallest (standardized) effect size (/slope) is |0.1|, it seems like an un-standardized rope rage should be (Sy/Sx) [-0.1 0.1]. This is equivalent to how - in frequentist / point estimates - determining if an effect is big is always dependent on the scale of Y and X. In other words, we cannot determine if a parameter is clinically significant without accounting X and Y's scales (as we've seen in all the examples we've shown here + in the vignette). One final example: Say we have a parameter of 2.5, HDI [1.0 4.0], and rope range, Sy [-0.1 0.1] = [-0.4 0.4], then it would seem that the HDI is fully outside the rope. I.e., a change of 1 unit in X produced a non-negligible change in Y. But if we learn that Sx is 0.01, then we would never observe a change in 1 unit of X (1 unit = Sx*100). Thus an appropriate rope range should consider not only the scale of Y but also that of X (does a clinically significant change X correspond to clinically significant change in Y?).

Okay, I think that's all I have to say about this topic.

Sorry for all the trouble - I've been sick at home these last couple of days and have had probably too much time to think...

DominiqueMakowski commented 5 years ago

I agree that the unit/scale of the parameter is super important and should be taken into consideration.

But I think it should be taken into consideration by the user (or eventually in wrappers that also implement some intelligent standardization) and not at bayestestR's level.

The reason for that is conceptual, the ROPE is basically a threshold on the response's scale under which an effect is considered as negligible and above which it is considered as relevant. Thus it makes conceptually sense that the ROPE is defined solely based on its scale, which is the scale of the response. Then it should the responsibility of the user to make sure that the parameters that are in his model make sense. Because this concern (that parameters are scaled in a way that their results are not optimally interpretable) can be extended to many analyses and we should add tons of warnings to every model (again we fall back on the "where do we stop" argument)...

mattansb commented 5 years ago

But I think it should be taken into consideration by the user

I definitely agree! But also, I think that we should give a warning simply because we have internal defaults, which in most cases will be inappropriate (i.e., in all cases where the predictors don't have an Sx=1).

As to "where do we stop" - I would say we "owe" our users a warning if our defaults don't give appropriate results (like when in the BF functions when a prior is not provided - the default will nearly never give an appropriate result, and so we say "hey, this is what we did. You probably don't want this, you should do X".)

DominiqueMakowski commented 5 years ago

I think we can close this for now, as we have discussed this issue in docs and vignettes.

strengejacke commented 5 years ago

I still have the email draft to Kruschke in Outlook, but I won't forget it. And the next day probably sent him an email.