easystats / bayestestR

:ghost: Utilities for analyzing Bayesian models and posterior distributions
https://easystats.github.io/bayestestR/
GNU General Public License v3.0
561 stars 55 forks source link

ROPE range defaults to [-0.1,0.1] instead of '0 +/- .1 * sd(y)' for stan_glm in describe_posterior #587

Closed Kethasi closed 1 year ago

Kethasi commented 1 year ago

I believe this is a bug and apologise in advance if I am just not understanding the intended functionality correctly.

When:

model_bayes <- stan_glm(...)

It seems that describe_posterior(model_bayes)

does not use rope_range = rope_range(model_bayes) but instead a default rope_range = [-.1,.1]

or at least I am only getting the 0 +/- .1 * sd(y) rope_range for

describe_posterior(model_bayes, rope_range = rope_range(model_bayes))

while

describe_posterior(model_bayes, rope_range = 'default')

always leads to ROPE [-0.10, 0.10]

Thanks for your help!

bwiernik commented 1 year ago

Can you give a specific working example? Are you working with a log-normal model by chance?

Kethasi commented 1 year ago

Thanks for the quick response. This is the specific dataset I am working with:

> data$y 
 [1]  7 38  6 11  0  8  6  5 28  2 32 14 46 19 20 28 51 12 10  1  2 22

> data$x
 [1] 46 41 30 30 61 42 63 13 52 59 42 45 55 47 36 48 70 67 50 34 72 46

> model_bayes <- stan_glm(y~x, data=data)
> describe_posterior(model_bayes) 

Parameter   | Median |          95% CI |     pd |          ROPE | % in ROPE |  Rhat |     ESS
---------------------------------------------------------------------------------------------
(Intercept) |   8.83 | [-13.91, 30.92] | 77.98% | [-0.10, 0.10] |     0.61% | 1.001 | 3568.00
x                 |   0.16 | [ -0.28,  0.63] | 76.42% | [-0.10, 0.10] |    28.68% | 1.001 | 3418.00

even though

> rope_range(model_bayes)
[1] -1.483619  1.483619

and correspondingly

> rope(model_bayes)
# Proportion of samples inside the ROPE [-1.48, 1.48]:

Parameter   | inside ROPE
-------------------------
(Intercept) |      8.03 %
x                |    100.00 %

So currently the functions rope() and describe_posterior() return different estimates of the % in ROPE for the same stan_glm model.
Kethasi commented 1 year ago

Sorry about the weird formatting, just copy-pasted from R

bwiernik commented 1 year ago

When you copy paste from R, put three back-ticks ``` on the line before and after for clean formatting

Kethasi commented 1 year ago

Thanks, I'll keep that in mind for future posts.

At first glance, do you think this apparent contradiction in the ROPE is based on a bug, or is there some step (standardization?) that is indeed meant to be applied to the ROPE?

Asking as I am currently using these methods for an academic research project and % in ROPE for my predictor of interest changes from 0% to 100% depending on the function used. Interestingly, in both the 0% and 100% in ROPE cases, pd as returned by describe_posterior() remains unchanged at >99% - not sure what to make of it but based on the ROPE definition of 0.1*SDy, I believe the null would need to be accepted despite the pd.

// Also just discovered this post https://github.com/easystats/bayestestR/issues/144 which makes me question the utility of the % in ROPE - especially since predictors with values that are numerically much larger than the outcome y - thus resulting in very small coefficients, are very likely to fall into a ROPE that is solely based on the SD of y, even when x and y are highly correlated. If there's been a compelling solution to this issue since these linked posts, I'd be grateful for any pointers!

EDIT: Shortly after posting this, I found the following passage from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6914840/:

Importantly, as [ROPE] is fixed to the scale of the response (it is expressed in the unit of the response), ROPE indices are sensitive to changes in the scale of the predictors. For instance, negligible results may change into non-negligible results when predictors are scaled up (e.g., reaction times expressed in seconds instead of milliseconds), which one inattentive or malicious researcher could misleadingly present as “significant” (note that indices of existence, such as the pd, would not be affected by this).

and another discussion of the parameter scaling issue here: https://mran.microsoft.com/snapshot/2021-03-14/web/packages/bayestestR/vignettes/region_of_practical_equivalence.html

It is curious that there is no straight-forward solution to this issue. Wouldn't normalizing all data to a 0-1 range solve this by making the % in ROPE robust to scaling issues?

DominiqueMakowski commented 1 year ago

Wouldn't normalizing all data to a 0-1 range solve this by making the % in ROPE robust to scaling issues?

Yes, standardizing can help and can be recommended, but is not a silver bullet (especially for models with factors, interactions, etc.). In general, I'd recommend using ROPE if you know and can justify what ROPE range is meaningful. If not, I'd go with other methods that require fewer assumptions. Hope that helps!

strengejacke commented 1 year ago

@vincentarelbundock I think the source of this issue is your commit here: https://github.com/easystats/bayestestR/pull/532/commits/3993b37fcc94248c24ce7543cec579b61d4bca59

in .describe_posterior(), we no longer use the original model object, but rather the data frame of posterior distributions. This means, rope() gets a data frame and thereby detecting the rope range does not work and -0.1 / 0.1 is used as default.

Do you remember why you did this change? For some function calls (like rope/rope range) in .describe_posterior() we still need the original model object. How can we best deal with this?

vincentarelbundock commented 1 year ago

I'm pretty sure this was for performance reasons. Some performance::performance calls I was making in modelsummary were extreeeemely slow, and allowing data frames let us extract the posterior once and re-use it instead of building the whole thing many times.

But I don't remember the exact details, I'm afraid.

strengejacke commented 1 year ago

We could workaround this by extracting some necessary variables at the beginning of .describe_posterior(), like calling rope_range() on the original model object and then continue with the data frame of posteriors.

strengejacke commented 1 year ago

I just saw it must have been already my commit here: https://github.com/easystats/bayestestR/pull/532/commits/ad9e5e5c30d38cedd3e477a14c684a89715d776d

And yes, this change resulted in a huge performance gain.