cccnrc / bayer

Bayesian analysis in R made easy
0 stars 1 forks source link

Enhancement proposals #1

Open rduesing opened 7 months ago

rduesing commented 7 months ago

Dear Enrico!

As requested, I will post my enhancement proposals here on github. Let's start with the generic example I gave. For example, I have 2 or more time points/groups and I want to predict sigma as well. How can I formulate a repeated measures model, which allows for heterogenity? My generic code was:

brm(data =data,
        family = student(),
        formula = bf(outcome ~ 0 + Intercept + time + (1 | ID),
        sigma ~ 0 + Intercept + time)
)

You "translated" it to the mtcars data set as follows:

brm( data = mtcars,
        family = student(),
        formula = bf( mpg ~ 0 + Intercept + hp, 
        sigma ~ 0 + Intercept + hp) 
)

Since my request specifically was concerned with nominal predictors, I would adjust you example and use "cyl" as predictor and reformulate the model for sigma, since this directly incorporates an enhancement request:

library(brms)
library(tidyverse)

data <- mtcars |> 
  mutate(cyl = as.factor(cyl))

mod.1 <- brm( data = data ,
              family = student(),
              formula = bf( mpg ~ 0 + Intercept + cyl, 
                            sigma ~ 0 + cyl)
)

Which should give you something like this as output with default priors:

Family: student 
  Links: mu = identity; sigma = log; nu = identity 
Formula: mpg ~ 0 + Intercept + cyl 
         sigma ~ 0 + cyl
   Data: data (Number of observations: 32) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     26.57      1.64    23.37    29.81 1.00     1239     1370
cyl6          -6.80      1.78   -10.40    -3.22 1.00     1346     1487
cyl8         -11.42      1.82   -14.90    -7.85 1.00     1370     1648
sigma_cyl4     1.53      0.25     1.08     2.06 1.00     2433     2145
sigma_cyl6     0.42      0.32    -0.13     1.11 1.00     2976     2371
sigma_cyl8     0.91      0.24     0.46     1.41 1.00     2716     2181

Further Distributional Parameters:
   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
nu    22.27     14.60     4.13    59.58 1.00     2809     2007

Now I tried it with bayer.

mod.2 <- bayer_linear(data = data,
                      outcome = "mpg",
                      covariates = c("cyl"),
                      robust = TRUE 
                      )

There were the first issues:

  1. I cannot explicitly request the Intercept. If you do not use centered or standardized variables (in case of only grouping variables this shouldn't matter, as in this example) it is recommended to use the formula 0 + Intercept + predictor instead of 1 + predictor . This is because brms assumes for the latter centered/standardized predictors and will automatically adjust the priors for the intercept. With the former formula, you can put specific priors on the Intercept itself. It is not clear in bayer, what exactly will be formulated and transferred to brms and hence stan.
  2. There seems to be no option to model sigma itself, to allow for heterogeneity depending on predictor(s).
  3. As can be seen in the difference between the formulas for mu and sigma, the former uses the indicator variable approach, setting one group as reference, whereas the latter uses the index variable approach, so that all groups get a separate intercept. This can be very helpful, especially in formulating priors, which is easier for the groups as for the slopes/differences in most cases (see McElreath, 2022) on this topic.
  4. What is missing also (or did I miss it) that you cannot formulate a random effect, like in my original code ... + (1|ID). There is no analogue in the mtcars example, but I think this is an important feature. (That's why I have chosen "time" as a generic example, which calls for the random intercept effect for ID/level 2 variable)

The bayer output looks like this:

Family: student 
  Links: mu = identity; sigma = identity; nu = identity 
Formula: mpg ~ cyl 
   Data: data2 (Number of observations: 32) 
  Draws: 20 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 20000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    26.50      1.10    24.28    28.64 1.00    10480    11372
cyl6         -6.77      1.62   -10.01    -3.55 1.00    11307    12558
cyl8        -11.37      1.42   -14.20    -8.60 1.00    11285    12426

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     3.16      0.49     2.30     4.23 1.00    11268     9940
nu       21.87     14.01     4.19    56.97 1.00    12136     8947

Of course the output looks very similar to the first model, if you look at the estimates for Intercept and slopes. But the difference is in the sigmas. The first model requested sigmas for each group separately, what is not possible (yet) in bayer. Additionally, for the first model, a log link was used to estimate the sigmas as compared to the identity link of the bayer model. This is automatically done by brms if you estimate sigma. Therefore an enhancement request:

  1. It would be nice if the estimated would automatically back transformed into the original metric. For example, to get sigma for "cyl4", I need to exponentiate the posterior of sigma_cyl4 and then look at the mean/median/MAP, to the the estimate.
  2. Likewise, this could be helpful for other generalized linear models

If I draw your posterior with the inbuilt function and plot it:

post.2 <- extract_posterior(mod.2, var = "cyl")
post.2

I get the following figure:

mod 2 post

  1. How does "log risk distribution" correspond to my model? Why "risk"? Shouldn't this be more flexible?
  2. Why is only the posterior shown for cyl8 and not cyl4 and cyl6 (or more specifically, shown is the difference of cyl8 as compared to the reference cyl4 --> it would be nice if I would be able to get the posterior for the differences [here cyl6 and cyl8], but also for the calculated posterior of the group, e.g. to get the posterior for cyl8 I need to wrangle the posterior cyl8 = b_Intercept + b_cyl8)?

Other requests concerning the posterior and posterior wrangling:

  1. In case of categorical predictors (for mu and sigma), it would be nice if I would get the mu_i and sigma_i as well as the differences between them automatically. Otherwise I need to wrangle it manually with the posterior myself.
  2. If I have more "advanced" posterior, it would be nice to get this plotted with automatically added information about specified ETI/HDI, ROPE, central tendencies (median, mean, MAP).
  3. It would be totally nice, if prior elicitation would be made easier, if the link function is not "identity". For example: I have a good grasp in which range my sigma for each group (in case of categorical predictor) will be, but it isn't easy to "transform" it into an appropriate log distribution. Assume I know sigma should be normally distributed trunc_normal(15, 3, lb = 0). How do I translate it in an appropriate prior distribution on the log scale. This gets more complicated with other prior distributions of course.

So, I really appreciate your effort and would like to help. My problem is not so much with the model formulation itself, but to "translate" the priors and to handle the posterior. I know what I want, but the scripting is time consuming and it would be a great help to automate some processes, which would make it also less prone for errors.

I am really looking forward to hear from you Rainer

cccnrc commented 6 months ago

Hi @rduesing , I have been implementing features in bayer. I will follow your points in my reply stick with the model example you provided with data from mtcars:

  1. specify Interpcept and 0/1: you can simply pass them as covariates: bayer_linear( ..., covariates = c( 0, "Intercept", "cyl", "qsec" ), ... )
  2. sigma specification: you can now use it in the model: bayer_linear( ..., sigma_covariates = c( 0, "cyl", "carb" ), nu_link = "logm1", ... )
  3. I guess it is answered in point 1, if not please let me know
  4. random effect variable: bayer_linear( ..., random_covariates = "gear", ... )
  5. estimate back-transformed to the original metric: you can now specify bayer_linear( ..., transformed = FALSE, ...) and you will get results (and plots) with original metrics
  6. this is applied to all families with log or logit transformed variables and sigma
  7. The default word for plots is now posterior, I also added a extract_posterior( ..., plot_risk_word = 'posterior', ...) option to allow for more variability based on users need
  8. now all levels for a factor variables (and relative plots) are returned, user can also specify a specific level with extract_posterior( ..., var = "cyl", var_level = 6, var_level_sigma=TRUE, ...) and the function will return this level values only. var_level_sigma is to specify if you want b_cyl6 or b_sigma_cyl6, in case you passed it as a variable for sigma analysis
  9. I think this is in point 8, in case is not please let me know
  10. advanced posterior: now you get a $stats in the returned object from extract_posterior() with several values (mean, sd, se, median, q1, q3, min, max, MAP, ETI, HDI, ROPE) for each level of the variable of interest (or the specified level if var_level was set. These are returned accordingly to transformed as well
  11. I guess this comes back to transformed, if not please let me know

Please note, to see changes you need to reinstall bayer: remotes::install_github( 'cccnrc/bayer' )

I am still working on adapting few things, I would love to have your suggestion on plot text values as example. I kindly ask you to give it a try to the new version and let me know if you find any problem, bugs, and further enhancement!

Thanks again for reaching out!

Enrico

rduesing commented 6 months ago

HI Enrico,

many thanks. I tried two models, my own and the mtcars example, but I couldnt get it running. I installed the latest version of bayer, on the help site random_covariates is displayed, so that cant be the issue.

The code:

bayer.1 <- bayer_linear(data = data_s_long_nm,
                        outcome = "values",
                        robust = TRUE,
                        covariates = c("0", "Intercept", "CBLC"),
                        random_covariates = "Pat_ID",
                        sigma_covariates = c("0", "CBLC"),
                        cores = 4,
                        brm_iter = 6000)

dat <- mtcars

bayer.test <- bayer_linear(data = dat,
                           outcome = "mpg",
                           covariates = c( 0, "Intercept", "cyl", "carb"),
                           sigma_covariates = c( 0, "cyl", "carb" ), 
                           random_covariates = "gear")

I get the following errors:

"outcome" passed is numeric: suitable for Linear regression: [1] "num: 164 -> mean: 65.95 (9.54) - median: 65.5 [60 - 72] - min: 37, max: 90 - NA: 0 (0%)" Fehler in riptw::get_formula(outcome = outcome, covariates = covariates, : unbenutztes Argument (random_covariates = random_covariates)

and

"outcome" passed is numeric: suitable for Linear regression: [1] "num: 32 -> mean: 20.09 (6.03) - median: 19.2 [15.43 - 22.8] - min: 10.4, max: 33.9 - NA: 0 (0%)" Fehler in riptw::get_formula(outcome = outcome, covariates = covariates, : unbenutztes Argument (random_covariates = random_covariates)

It translates "unused argument (random_covariates)..." I get the same message, even if I do not specify any random effects.

cccnrc commented 6 months ago

Hi @rduesing this was an error due to a change I needed to operate on get_formula() from riptw that is imported in bayer

Can you try to reinstall riptw too and see if this solves it?

remotes::install_github( "cccnrc/riptw" )
rduesing commented 6 months ago

Hi Enrico,

it runs now, many thanks. I am still testing, but what I found with my real dataset: your default max_tree_depth is 15, which seems quite large to me and maybe unnecessary high. This might slow down the sampling (but I am not an expert here!). Additionally, I found that with a tree depth of 15, I had divergent chains and a very low ESS, as well as high Rhat values. When I reduced it to 12 (10 is default), everything worked fine.

Maybe you can find answers here: https://discourse.mc-stan.org/t/the-role-of-max-treedepth-in-no-u-turn/24155/3

One more question/request: is it possible to have random slopes and if "yes" already, how to formulate it? So far, we only used random intercepts.

I will get back to you, as soon as I have new insights. Thanks for your help!

cccnrc commented 6 months ago

Hi @rduesing thanks for pointing this out! I changed the default to 10 and you can always customize it through brm_max_treedepth option

Regarding random slopes, can you provide a model (just a simple regression one, not a brms) formula in order to see what is implementable yet?

Thanks again, Enrico

rduesing commented 6 months ago

I dont know if this can "simply" be written as an regression model, since it will have several error terms. In the literature you will find different notations to write multilevel models. I try to use one that is understandable for me.

Let's start with a simple regression model, without any random effects. This would look like this for example:

yi = β0 + β1xi + ei

So, no random effect and the intercept and slope(s) are fixed/population effects for all participants. In R you can use the same formula with lm and brms:

y ~ 1 + x or alternatively in brms also: y ~ 0 + Intercept + x

Now, if you have a random intercept, there will be a fixed/population part for the intercept, as well as a random part. You can write this as two equations:

yij = β0j + β1 x + eij

where β0j can be written as:

β0j = γ00 + U0j

where γ00 represents the population/fixed effect Intercept and U0j the group specific effect, i.e. the random part.

Some authors substitute it into one formula (whereas I myself find it simpler to think in several formulas...):

yij = γ00 + U0j + β1 x + eij

It is not possible simply write it as a regression in R. Brms uses the syntax of the lme4 package, which is quite intuitive

y ~ 1 + x + (1 | ID) where the (1 | ID) is for the random Intercept

In a next step you not only allow for random Intercepts but also random slopes. Therefore, you need to specify which slopes (in case of several predictors) are allowed to vary within which clustering variable (there may be more than a two level model). This may look like this for example:

yij = β0j + β1j x + eij with

β0j = γ00 + U0j but also

β1j = γ10 + U1j

We can see that γ10 is he population/fixed effect of x on y, but that there is a cluster specific/random effect U1j, similar to the random intercept. Put together:

yij = γ00 + U0j + γ10 x + U1j x + eij
or yij = γ00 + γ10 x + U0j + U1j x + eij
to make the fixed and random parts clearer.

In lme4 and brms this would look like y ~ 1 + x + (x | ID)

Does this help?

cccnrc commented 6 months ago

I @rduesing I added the option random_slope to allow for random slope as you suggested:

mod.2 <- bayer_linear(data = data,
                       outcome = "mpg",
                       covariates = c("cyl", 'hp'),
                       random_slope = 'cyl',
                       random_covariates = "gear",
                       robust = TRUE
                       )

 > mod.2
# ...
 Formula: mpg ~ cyl + hp + (cyl | gear) 
# ...

Remember you need to update both bayer and riptw to the latest version:

remotes::install_github( "cccnrc/riptw" )
remotes::install_github( "cccnrc/bayer" )

Looking forward to hearing from you for everything!