Open thomvolker opened 2 years ago
Dear @thomvolker , I agree with your assessment and I had raised similar questions/concerns during early development of bain. @herberthoijtink we should discuss Tom's concerns, shall I plan a call with him?
With respect to the first issue. In case users omit the -1 we could add a message to the output stating that we have added it. I like the way it is now, because it protects the user against specifying hypotheses with respect to contrasts, which we can only handle using the 1-group approach, which in case of multiple groups is inferior to the multiple group approach. With respect to the second issue. This one is tricky. If users treat groups as groups, they should use the multiple group approach. If users treat groups as predictors, they can actually use the one group approach because then it becomes a multiple regression. It is as far as I know an unresolved discussion as to whether the latter would be wrong or inferior, actually for me at the moment it is a choice. I'm open to a meeting.
Thank you for your responses!
@herberthoijtink I don't entirely understand your second remark. Specifically, I don't entirely understand the difference between "treating groups as groups" and "treating groups as predictors". Is this to say, it depends on whether a researcher is interested in differences between the groups (then groups are treated as groups) or merely includes the groups as a covariate (and then this is including groups as a predictor)? Say I have run an experiment with two groups, a control group and a group with an experimental manipulation, and I am interested in the effect of the experimental manipulation on an outcome. Would this imply treating groups as groups or treating groups as predictors?
Additionally, I noticed some other strange behavior of bain
(I'm posting it here because I didn't want to spam your issues board, but if you would rather have it as a new issue, I can edit this post and open a new issue).
The "issue" here is that bain
accepts models of class glm
(which I'm not sure of whether this is indeed supposed to happen), but that when an intercept is included, it refits the model as an lm
model (this works if the outcome is continuous, but throws an error if the outcome is a factor). I would assume that you either want to accept glms, but then also want to refit models as glms, or do not want to accept glms at all. The following reprex shows what I mean (comments are included within the reprex).
In short, bain
accepts glm
models either when only numeric variables are included as predictors, when the outcome is numeric but on a 0/1 scale, but not when there are factor predictor variable involved and the outcome is also a factor. Accordingly, I wonder whether bain
actually can deal with glms, especially because when I run bain
on a glm (and no error is returned), bain
returns the line
Bayesian informative hypothesis testing for an object of class glm:
# Load required packages
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
#> Warning: package 'magrittr' was built under R version 4.1.2
library(bain)
dat1 <- sesamesim %>%
mutate(sex = sex - 1,
site = as.factor(site))
dat2 <- sesamesim %>%
mutate(sex = factor(sex, labels = c("boy", "girl")),
site = as.factor(site))
fit1 <- glm(sex ~ site + prenumb,
family = binomial,
data = dat1)
summary(fit1)
#>
#> Call:
#> glm(formula = sex ~ site + prenumb, family = binomial, data = dat1)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.358 -1.206 1.029 1.130 1.341
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.15210 0.41475 0.367 0.714
#> site2 0.34588 0.37716 0.917 0.359
#> site3 0.04317 0.38179 0.113 0.910
#> site4 0.25095 0.40818 0.615 0.539
#> site5 -0.18144 0.54183 -0.335 0.738
#> prenumb -0.00917 0.01360 -0.674 0.500
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 332.29 on 239 degrees of freedom
#> Residual deviance: 330.43 on 234 degrees of freedom
#> AIC: 342.43
#>
#> Number of Fisher Scoring iterations: 3
bf1 <- bain(fit1, "site2 > 0 & site5 < 0")
bf1
#> Bayesian informative hypothesis testing for an object of class glm:
#>
#> Fit Com BF.u BF.c PMPa PMPb
#> H1 0.000 0.249 0.000 0.000 1.000 0.000
#> Hu 1.000
#>
#> Hypotheses:
#> H1: site2>0&site5<0
#>
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.
summary(bf1)
#> Parameter n Estimate lb ub
#> 1 site1 60 0.490438410 0.361223888 0.619652932
#> 2 site2 55 0.576339379 0.438023028 0.714655730
#> 3 site3 64 0.501324850 0.370906355 0.631743344
#> 4 site4 43 0.552872746 0.401450005 0.704295487
#> 5 site5 18 0.445372841 0.212542146 0.678203536
#> 6 prenumb 240 -0.002273624 -0.008952232 0.004404983
## ALthough in line with the hypothesis, the fit is zero,
## because bain seems to fit a linear probability model,
## and probabilities cannot be below 0 (this also explains
## why the estimated coefficient of prenumb is much smaller
## here).
fit2 <- glm(sex ~ site + prenumb,
family = binomial,
data = dat2)
bf2 <- bain(fit2, "site2 > 0 & site5 < 0")
#> Warning in model.response(mf, "numeric"): using type = "numeric" with a factor
#> response will be ignored
#> Warning in Ops.factor(y, z$residuals): '-' not meaningful for factors
#> Warning in Ops.factor(r, 2): '^' not meaningful for factors
#> Error in eigen(Sigma): infinite or missing values in 'x'
## Throws an error, because `lm` is not applicable
## to factors.
fit3 <- glm(sex ~ prenumb,
family = binomial,
data = dat2)
## But the problem dissolves when only a continuous predictor is included.
bf3 <- bain(fit3, "prenumb < 0")
bf3
#> Bayesian informative hypothesis testing for an object of class glm:
#>
#> Fit Com BF.u BF.c PMPa PMPb
#> H1 0.712 0.500 1.423 2.468 1.000 0.587
#> Hu 0.413
#>
#> Hypotheses:
#> H1: prenumb<0
#>
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.
summary(bf3)
#> Parameter n Estimate lb ub
#> 1 prenumb 240 -0.006814645 -0.03074668 0.01711739
Created on 2022-03-09 by the reprex package (v2.0.0)
Your first question is one to deal with face to face. With respect to glm, I did not know that bain could process glm objects. These are not discussed in the vignette, these are not debugged, so don't use them as input for bain, use named vectors (see the vignette) instead. In the next version of bain I will arrange that glm objects can not process. I'm rather certain that the results obtained will not be correct.
Dear @herberthoijtink , I'm pretty sure that BFpack uses bain:::get_estimates.glm()
. So it's important not to simply throw an error with that function. More suitable would be to give an error in bain:::bain.default()
if an unsupported object is passed. The estimates object contains information about the class of the original object.
@CasparVanLissa. Perfect. We will do that in the next version (probably winter 2022). Please note that if BFpack feeds in glm object into bain, it will not give correct results. The proper processing of glm object is quite different from that of a lm object and can currently only be achieved using the named vector input. If you know that BFpack feeds glm into bain, maybe good to warn Joris Mulder!?
@cjvanlissa @herberthoijtink If the next update it scheduled for winter 2022, you might want to consider adding a bain.glm function. I figure that to a large extent, it should be similar relatively similar to the bain.lm function (but perhaps I am missing some important details).
I assume that the most important difference is that it is not at all straightforward to accommodate the multiple group approach, which is dependent on which member of the glm family is fitted.
On a side note, you might also want to consider adding robust standard errors to bain as input argument. With the sandwich
package in R, it is relatively straightforward to calculate variance-covariance matrices that are adjusted for clustered residuals, heteroscedasticity, etc.
I don't have time to dive into this right now, but I might be able to spend some time on this from May onwards.
@ThomVolker. Let me know if you want to work on these things. Look e.g. at the logistic regression examples in the bain vignette, with named vector input all of this can be done, however, to construct a user friendly wrapper is another thing. I may have a few pointers that might speed things along.
When estimating the effect of a categorical variable, such as in, for example, an ANOVA,
bain
by default drops the intercept. This seems like a logical decision, but results in seemingly unexpected behavior, because there is no warning message shown. Accordingly, the estimate as given in anlm
model reflects the difference between two categories, whilebain
uses the same name to refer to the group specific mean. This issue becomes apparent in the following reprex.Created on 2022-03-09 by the reprex package (v2.0.0)
On a related note, researchers may also manually dummy code their categories. If this is the case,
bain
does not remove the intercept, but estimates the difference. However, this ignores the fact that there are multiple groups involved, which results in an inconsistent Bayes factor (Hoijtink, Gu & Mulder, 2019). Is it indeed problematic to create numeric dummy variables, and estimate the effect of these usingbain
, because the resulting Bayes factors are highly similar (see reprex below). If this is actually a problematic way to estimate the effects of categorical variables, it may be worthwhile to letbain
recognize numeric dummy variables, and display a warning if these are observed.Created on 2022-03-09 by the reprex package (v2.0.0)