cjvanlissa / bain

Bayes Factors for Informative Hypotheses
GNU General Public License v3.0
8 stars 5 forks source link

Implicitly removing the intercept in `bain` results in seemingly unexpected behavior #39

Open thomvolker opened 2 years ago

thomvolker commented 2 years ago

When estimating the effect of a categorical variable, such as in, for example, an ANOVA, bain by default drops the intercept. This seems like a logical decision, but results in seemingly unexpected behavior, because there is no warning message shown. Accordingly, the estimate as given in an lm model reflects the difference between two categories, while bain uses the same name to refer to the group specific mean. This issue becomes apparent in the following reprex.

# Load required packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
#> Warning: package 'magrittr' was built under R version 4.1.2
library(bain)

set.seed(1)

# Load data and specify sex as a factor variable
dat1 <- sesamesim %>%
  mutate(sex = factor(sex, labels = c("boy", "girl")))

# Fit model
mod1 <- lm(postnumb ~ sex, dat1)

# Summary of model
summary(mod1)
#> 
#> Call:
#> lm(formula = postnumb ~ sex, data = dat1)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -30.096  -8.096  -0.856   8.144  32.904 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   30.096      1.175  25.616   <2e-16 ***
#> sexgirl       -1.240      1.628  -0.761    0.447    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 12.6 on 238 degrees of freedom
#> Multiple R-squared:  0.00243,    Adjusted R-squared:  -0.001761 
#> F-statistic: 0.5799 on 1 and 238 DF,  p-value: 0.4471

# Note that sexgirl here reflects the difference with sexboy
# If we use this model in bain, and specify a negative 
# effect for sexgirl (which seems like a logical thing to do),
# this results in unexpected behavior, because bain implicitly
# transforms the intercept and difference into two separate means.
# However, this only becomes apparent after inspecting the summary
# of the bain output.
bf1 <- bain(mod1, "sexgirl < 0")

bf1
#> Bayesian informative hypothesis testing for an object of class lm (ANOVA):
#> 
#>    Fit   Com   BF.u  BF.c  PMPa  PMPb 
#> H1 0.000 0.500 0.000 0.000 1.000 0.000
#> Hu                               1.000
#> 
#> Hypotheses:
#>   H1: sexgirl<0
#> 
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.
# Fit is 0, which seems weird, because it is in line with the effect
# as estimated by the `lm` function.

# Only when calling summary(mod1) it becomes apparent that `sexgirl`
# suddenly refers to the mean of the girls, rather than the difference
# between boys and girls.
summary(bf1)
#>   Parameter   n Estimate       lb       ub
#> 1    sexboy 115 30.09565 27.79295 32.39836
#> 2   sexgirl 125 28.85600 26.64732 31.06468

Created on 2022-03-09 by the reprex package (v2.0.0)

On a related note, researchers may also manually dummy code their categories. If this is the case, bain does not remove the intercept, but estimates the difference. However, this ignores the fact that there are multiple groups involved, which results in an inconsistent Bayes factor (Hoijtink, Gu & Mulder, 2019). Is it indeed problematic to create numeric dummy variables, and estimate the effect of these using bain, because the resulting Bayes factors are highly similar (see reprex below). If this is actually a problematic way to estimate the effects of categorical variables, it may be worthwhile to let bain recognize numeric dummy variables, and display a warning if these are observed.

# Load required packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
#> Warning: package 'magrittr' was built under R version 4.1.2
library(bain)

set.seed(1)

# Make the site variable a factor
bf_group_means <- sesamesim %>%
  mutate(site = as.factor(site)) %$%
  lm(postnumb ~ site + prenumb) %>%
  bain("site3 < site1 < site2 & prenumb > 0")

bf_group_means
#> Bayesian informative hypothesis testing for an object of class lm (ANCOVA):
#> 
#>    Fit   Com   BF.u  BF.c   PMPa  PMPb 
#> H1 0.443 0.057 7.794 13.189 1.000 0.886
#> Hu                                0.114
#> 
#> Hypotheses:
#>   H1: site3<site1<site2&prenumb>0
#> 
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.

# Create dummies manually, I'll use site = 1 as the reference
bf_differences <- sesamesim %>%
  mutate(site1 = ifelse(site == 1, 1, 0),
         site2 = ifelse(site == 2, 1, 0),
         site3 = ifelse(site == 3, 1, 0),
         site4 = ifelse(site == 4, 1, 0),
         site5 = ifelse(site == 5, 1, 0)) %$%
  lm(postnumb ~ site2 + site3 + site4 + site5 + prenumb) %>%
  bain("site3 < 0 & site2 > 0 & prenumb > 0")

bf_differences
#> Bayesian informative hypothesis testing for an object of class lm (continuous predictors):
#> 
#>    Fit   Com   BF.u  BF.c   PMPa  PMPb 
#> H1 0.448 0.055 8.118 13.888 1.000 0.890
#> Hu                                0.110
#> 
#> Hypotheses:
#>   H1: site3<0&site2>0&prenumb>0
#> 
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.
# Almost identical fit & complexity

summary(bf_group_means)
#>   Parameter   n   Estimate         lb        ub
#> 1     site1  60 27.4293820 25.1603844 29.698380
#> 2     site2  55 34.9818244 32.5529993 37.410650
#> 3     site3  64 27.6904082 25.4002689 29.980548
#> 4     site4  43 26.9840163 24.3250440 29.642989
#> 5     site5  18 31.4298837 27.3413936 35.518374
#> 6   prenumb 240  0.7159311  0.5986552  0.833207
summary(bf_differences)
#>   Parameter   n  Estimate         lb        ub
#> 1     site3 240 0.2610262 -3.0455948  3.567647
#> 2     site2 240 7.5524424  4.3017084 10.803176
#> 3   prenumb 240 0.7159311  0.5986552  0.833207
# In the latter, the grouping structure is ignored

Created on 2022-03-09 by the reprex package (v2.0.0)

cjvanlissa commented 2 years ago

Dear @thomvolker , I agree with your assessment and I had raised similar questions/concerns during early development of bain. @herberthoijtink we should discuss Tom's concerns, shall I plan a call with him?

herberthoijtink commented 2 years ago

With respect to the first issue. In case users omit the -1 we could add a message to the output stating that we have added it. I like the way it is now, because it protects the user against specifying hypotheses with respect to contrasts, which we can only handle using the 1-group approach, which in case of multiple groups is inferior to the multiple group approach. With respect to the second issue. This one is tricky. If users treat groups as groups, they should use the multiple group approach. If users treat groups as predictors, they can actually use the one group approach because then it becomes a multiple regression. It is as far as I know an unresolved discussion as to whether the latter would be wrong or inferior, actually for me at the moment it is a choice. I'm open to a meeting.

thomvolker commented 2 years ago

Thank you for your responses!

@herberthoijtink I don't entirely understand your second remark. Specifically, I don't entirely understand the difference between "treating groups as groups" and "treating groups as predictors". Is this to say, it depends on whether a researcher is interested in differences between the groups (then groups are treated as groups) or merely includes the groups as a covariate (and then this is including groups as a predictor)? Say I have run an experiment with two groups, a control group and a group with an experimental manipulation, and I am interested in the effect of the experimental manipulation on an outcome. Would this imply treating groups as groups or treating groups as predictors?

Additionally, I noticed some other strange behavior of bain (I'm posting it here because I didn't want to spam your issues board, but if you would rather have it as a new issue, I can edit this post and open a new issue). The "issue" here is that bain accepts models of class glm (which I'm not sure of whether this is indeed supposed to happen), but that when an intercept is included, it refits the model as an lm model (this works if the outcome is continuous, but throws an error if the outcome is a factor). I would assume that you either want to accept glms, but then also want to refit models as glms, or do not want to accept glms at all. The following reprex shows what I mean (comments are included within the reprex).

In short, bain accepts glm models either when only numeric variables are included as predictors, when the outcome is numeric but on a 0/1 scale, but not when there are factor predictor variable involved and the outcome is also a factor. Accordingly, I wonder whether bain actually can deal with glms, especially because when I run bain on a glm (and no error is returned), bain returns the line Bayesian informative hypothesis testing for an object of class glm:

# Load required packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
#> Warning: package 'magrittr' was built under R version 4.1.2
library(bain)

dat1 <- sesamesim %>%
  mutate(sex = sex - 1,
         site = as.factor(site))

dat2 <- sesamesim %>%
  mutate(sex = factor(sex, labels = c("boy", "girl")),
         site = as.factor(site))

fit1 <- glm(sex ~ site + prenumb,
            family = binomial,
            data = dat1)

summary(fit1)
#> 
#> Call:
#> glm(formula = sex ~ site + prenumb, family = binomial, data = dat1)
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -1.358  -1.206   1.029   1.130   1.341  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)
#> (Intercept)  0.15210    0.41475   0.367    0.714
#> site2        0.34588    0.37716   0.917    0.359
#> site3        0.04317    0.38179   0.113    0.910
#> site4        0.25095    0.40818   0.615    0.539
#> site5       -0.18144    0.54183  -0.335    0.738
#> prenumb     -0.00917    0.01360  -0.674    0.500
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 332.29  on 239  degrees of freedom
#> Residual deviance: 330.43  on 234  degrees of freedom
#> AIC: 342.43
#> 
#> Number of Fisher Scoring iterations: 3

bf1 <- bain(fit1, "site2 > 0 & site5 < 0")

bf1
#> Bayesian informative hypothesis testing for an object of class glm:
#> 
#>    Fit   Com   BF.u  BF.c  PMPa  PMPb 
#> H1 0.000 0.249 0.000 0.000 1.000 0.000
#> Hu                               1.000
#> 
#> Hypotheses:
#>   H1: site2>0&site5<0
#> 
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.
summary(bf1)
#>   Parameter   n     Estimate           lb          ub
#> 1     site1  60  0.490438410  0.361223888 0.619652932
#> 2     site2  55  0.576339379  0.438023028 0.714655730
#> 3     site3  64  0.501324850  0.370906355 0.631743344
#> 4     site4  43  0.552872746  0.401450005 0.704295487
#> 5     site5  18  0.445372841  0.212542146 0.678203536
#> 6   prenumb 240 -0.002273624 -0.008952232 0.004404983
## ALthough in line with the hypothesis, the fit is zero,
## because bain seems to fit a linear probability model,
## and probabilities cannot be below 0 (this also explains
## why the estimated coefficient of prenumb is much smaller
## here).

fit2 <- glm(sex ~ site + prenumb,
            family = binomial,
            data = dat2)

bf2 <- bain(fit2, "site2 > 0 & site5 < 0")
#> Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
#> response will be ignored
#> Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
#> Warning in Ops.factor(r, 2): '^' not meaningful for factors
#> Error in eigen(Sigma): infinite or missing values in 'x'
## Throws an error, because `lm` is not applicable
## to factors.

fit3 <- glm(sex ~ prenumb,
            family = binomial,
            data = dat2)

## But the problem dissolves when only a continuous predictor is included.
bf3 <- bain(fit3, "prenumb < 0")
bf3
#> Bayesian informative hypothesis testing for an object of class glm:
#> 
#>    Fit   Com   BF.u  BF.c  PMPa  PMPb 
#> H1 0.712 0.500 1.423 2.468 1.000 0.587
#> Hu                               0.413
#> 
#> Hypotheses:
#>   H1: prenumb<0
#> 
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.
summary(bf3)
#>   Parameter   n     Estimate          lb         ub
#> 1   prenumb 240 -0.006814645 -0.03074668 0.01711739

Created on 2022-03-09 by the reprex package (v2.0.0)

herberthoijtink commented 2 years ago

Your first question is one to deal with face to face. With respect to glm, I did not know that bain could process glm objects. These are not discussed in the vignette, these are not debugged, so don't use them as input for bain, use named vectors (see the vignette) instead. In the next version of bain I will arrange that glm objects can not process. I'm rather certain that the results obtained will not be correct.

cjvanlissa commented 2 years ago

Dear @herberthoijtink , I'm pretty sure that BFpack uses bain:::get_estimates.glm(). So it's important not to simply throw an error with that function. More suitable would be to give an error in bain:::bain.default() if an unsupported object is passed. The estimates object contains information about the class of the original object.

herberthoijtink commented 2 years ago

@CasparVanLissa. Perfect. We will do that in the next version (probably winter 2022). Please note that if BFpack feeds in glm object into bain, it will not give correct results. The proper processing of glm object is quite different from that of a lm object and can currently only be achieved using the named vector input. If you know that BFpack feeds glm into bain, maybe good to warn Joris Mulder!?

thomvolker commented 2 years ago

@cjvanlissa @herberthoijtink If the next update it scheduled for winter 2022, you might want to consider adding a bain.glm function. I figure that to a large extent, it should be similar relatively similar to the bain.lm function (but perhaps I am missing some important details).

I assume that the most important difference is that it is not at all straightforward to accommodate the multiple group approach, which is dependent on which member of the glm family is fitted.

On a side note, you might also want to consider adding robust standard errors to bain as input argument. With the sandwich package in R, it is relatively straightforward to calculate variance-covariance matrices that are adjusted for clustered residuals, heteroscedasticity, etc.

I don't have time to dive into this right now, but I might be able to spend some time on this from May onwards.

herberthoijtink commented 2 years ago

@ThomVolker. Let me know if you want to work on these things. Look e.g. at the logistic regression examples in the bain vignette, with named vector input all of this can be done, however, to construct a user friendly wrapper is another thing. I may have a few pointers that might speed things along.