eric-pedersen / mixed-effect-gams

Repo for tutorial/paper on mixed-effect GAMs
MIT License
120 stars 32 forks source link

What is the correct interpretation of a "global smoother" in a logistic GAMM? #58

Open isabellaghement opened 2 years ago

isabellaghement commented 2 years ago

Not sure this is the best place to ask this question, but it's been bugging me for a while now.

The HGAM paper talks about Model GS (global smoother plus group-level smoothers with a shared penalty). I have difficulty understanding what this global smoother means in a logistic HGAM model. Using simulated data from the itsadug package, here is an example of such model (I simplified the data to include only 10 subjects: 5 adults and 5 children).

The R commands below show how I prepared the data:

library(itsadug)
library(dplyr)
library(magrittr)
library(mgcv)

data(simdat)

simdat$Subject <- as.factor(simdat$Subject)
simdat$Y <- ifelse(simdat$Y > mean(simdat$Y), 1, 0)
table(simdat$Y)

unique(simdat$Y)

data(simdat)

simdat1 <- filter(simdat, 
                  Subject %in% c("a01", 
                                 "a02", 
                                 "a03", 
                                 "a04", 
                                 "a05", 
                                 "c01", 
                                 "c02", 
                                 "c03", 
                                 "c04", 
                                 "c05")) %>% 
            droplevels()

Here is the logistic HGAM:

m <- bam(Y ~ Group + s(Time, by=Group) +
            s(Time, Subject, bs="fs", m=1), 
          family = binomial(link = "logit"), 
          data = simdat1)

The smooths included in this model are given by:

gratia::smooths(m)

and are "s(Time):GroupChildren", "s(Time):GroupAdults" and "s(Time,Subject)".

The first two smooths are "global" smooths of Time for Children and Adults, respectively. These smooths can be visualized with the R command:

gratia::draw(m, select = c("s(Time):GroupChildren", "s(Time):GroupAdults"))

and look as follows:

image

The random factor smooths encompassed by "s(Time,Subject)" look like this:

image

The R code for visualizing the random factor smooths is:

plot_random_smooths <- function(model, select, predictor) {

  library(itsadug)
  library(ggplot2)
  library(directlabels)

  theme_set(theme_bw())

  # Plot random effects in separate panels:
  pp <- get_modelterm(model, 
                      select = select, 
                      as.data.frame=TRUE)

  pg <- ggplot(data = pp, 
               aes(x = {{ predictor }}, y = fit)) + 
    geom_line(aes(colour = Subject), size = 0.9) + 
    ylab("Partial Effect (Random Factor Smooth)") 

  pg <- direct.label(pg)

  pg 
}

plot_random_smooths(model = m, 
                    select = 3, 
                    predictor = Time) 

The effect of Time for each of the actual Subject can be plotted on the log odds scale, leading to the following plot:

image

In this plot, I also show the effect that would correspond to the "global" smooth in each of the two groups: Adults and Children.

Furthermore, the effect of Time for each of the actual Subject can be plotted on the probability scale, leading to the following plot:

image

Usually, with these types of mixed effects models, a non-linear link function (in this case, logit) means that conditional effects (e.g., effect of Time on probability that Y = 1 for a specific Subject in a specific Group) are different animals than marginal effects (e. g., effect of Time on probability that Y = 1 averaged across all Subjects in a specific Group).

On the log odds scale, I think this differentiation between conditional and marginal effects is immaterial, as the two effects would be the same (?).

While I feel relatively comfortable describing the effect of Time on the log odds that Y = 1 for Adults or for Children as representing an "average effect" across all the children in that group in the underlying population, I am not sure what language to use to describe the same effect when expressed on the probability scale. I don't think it would be an "average effect" anymore? (One thing that bothers me about the "average" interpretation is that the random factor smooths do NOT average to 0.)

Usually, if the model included only random intercepts and/or a random slopes for Time, I would use language like "effect for typical subject" to describe the effect of Time on the log odds that Y = 1 for Adults or for Children. A "typical subject" would be one for whom the random intercepts and/or random slopes were equal to 0. But when the model includes by-subject random factor smooths for Time and a non-linear link, I am not sure this "typical subject" language makes sense when we "exclude" the random factor smooth "s(Time,Subject)" from the model.

I would very much appreciate any thoughts or pointers on language that is safe to use to describe what the "global" smooths represent in the context of a logistic HGAM. (Maybe I am over-complicating this in my mind, but that is how my mind works - it builds on what it already knows and that can be a hindrance at times.)

In case this may help, here is the R code I used for constructing the plot of effects of Time on the log odds scale:


devtools::install_github("m-clark/gammit")
library(gammit)

eff_Time_typical_subject_log_odds_Adults <- 
     predict_gamm(m, 
                      newdata = expand.grid(
                                Time = seq(0, 2000, by = 100),
                                Group = "Adults", 
                                Subject = c("a01", 
                                           "a02", 
                                           "a03", 
                                           "a04",
                                           "a05")), 
                      re_form = NULL, 
                      exclude = "s(Time,Subject)", 
                      keep_prediction_data = TRUE,
                      se.fit = TRUE)

unique(eff_Time_typical_subject_log_odds_Adults$Subject)

eff_Time_typical_subject_log_odds_Children <- 
  predict_gamm(m, 
               newdata = expand.grid(
                 Time = seq(0, 2000, by = 100),
                 Group = "Children", 
                 Subject = c("c01", 
                             "c02", 
                             "c03", 
                             "c04",
                             "c05")), 
               re_form = NULL, 
               exclude = "s(Time,Subject)", 
               keep_prediction_data = TRUE, 
               se.fit = TRUE)

unique(eff_Time_typical_subject_log_odds_Children$Subject)

eff_Time_actual_subjects_log_odds_Adults <- 
  predict_gamm(m, 
               newdata = expand.grid(
                 Time = seq(0, 2000, by = 100),
                 Group = "Adults", 
                 Subject = c("a01", 
                             "a02", 
                             "a03", 
                             "a04",
                             "a05")), 
               re_form = NULL, 
               keep_prediction_data = TRUE, 
               se.fit = TRUE)

unique(eff_Time_actual_subjects_log_odds_Adults$Subject)

eff_Time_actual_subjects_log_odds_Children <- 
  predict_gamm(m, 
               newdata = expand.grid(
                 Time = seq(0, 2000, by = 100),
                 Group = "Children", 
                 Subject = c("c01", 
                             "c02", 
                             "c03", 
                             "c04",
                             "c05")), 
               re_form = NULL, 
               keep_prediction_data = TRUE, 
               se.fit = TRUE)

unique(eff_Time_actual_subjects_log_odds_Children$Subject)

# plot log odds for each actual subject and for typical subject
# (log odds for typical subject are shown as a reference 
#  with dark grey dotted line)

library(magrittr)
library(ggeasy)

plot_eff_Time_log_odds_Adults <- 
  eff_Time_actual_subjects_log_odds_Adults %>%
  ggplot(aes(x = Time, y = prediction.fit, colour = Subject, fill = Subject)) +
  geom_ribbon(aes(x = Time, 
                  ymin = prediction.fit - 2*prediction.se.fit, 
                  ymax = prediction.fit + 2*prediction.se.fit), 
              alpha = 0.3, size = 0.9) +
  geom_line(aes(x = Time, y = prediction.fit), size = 0.9) +
  geom_line(data = eff_Time_typical_subject_log_odds_Adults, 
            colour = "#374E55", 
            size = 1, linetype = 2) + 
  facet_wrap(~Subject,  nrow = 1) + 
  ggtitle(paste0("Group = ", "Adults")) + 
  easy_remove_legend() + 
  ylab("Log Odds") + 
  theme(panel.spacing = unit(2, "lines")) 

plot_eff_Time_log_odds_Adults

plot_eff_Time_log_odds_Children <- 
  eff_Time_actual_subjects_log_odds_Children %>%
  ggplot(aes(x = Time, y = prediction.fit, colour = Subject, fill = Subject)) +
  geom_ribbon(aes(x = Time, 
                  ymin = prediction.fit - 2*prediction.se.fit, 
                  ymax = prediction.fit + 2*prediction.se.fit), 
              alpha = 0.3, size = 0.9) +
  geom_line(aes(x = Time, y = prediction.fit), size = 0.9) +
  geom_line(data = eff_Time_typical_subject_log_odds_Children, 
            colour = "#374E55", 
            size = 1, linetype = 2) + 
  facet_wrap(~Subject, nrow = 1) + 
  ggtitle(paste0("Group = ", "Children")) + 
  easy_remove_legend() + 
  ylab("Log Odds") + 
  theme(panel.spacing = unit(2, "lines")) 

plot_eff_Time_log_odds_Children

library(patchwork)
plot_eff_Time_log_odds_Adults/plot_eff_Time_log_odds_Children

Furthermore, here is the R code I used for constructing the plot of effects of Time on the probability scale:


family(m)

ilink <- family(m)$linkinv

eff_Time_typical_subject_probability_Adults <- 
  eff_Time_typical_subject_log_odds_Adults %>% 
  rename(fit_link = prediction.fit, se_link = prediction.se.fit) %>% 
  mutate(fit_resp  = ilink(fit_link),
         right_upr = ilink(fit_link + (2 * se_link)), 
         right_lwr = ilink(fit_link - (2 * se_link)))

eff_Time_typical_subject_probability_Children <- 
  eff_Time_typical_subject_log_odds_Children %>% 
  rename(fit_link = prediction.fit, se_link = prediction.se.fit) %>% 
  mutate(fit_resp  = ilink(fit_link),
         right_upr = ilink(fit_link + (2 * se_link)), 
         right_lwr = ilink(fit_link - (2 * se_link)))

eff_Time_actual_subjects_probability_Adults <- 
  eff_Time_actual_subjects_log_odds_Adults %>% 
  rename(fit_link = prediction.fit, se_link = prediction.se.fit) %>% 
  mutate(fit_resp  = ilink(fit_link),
         right_upr = ilink(fit_link + (2 * se_link)), 
         right_lwr = ilink(fit_link - (2 * se_link)))

eff_Time_actual_subjects_probability_Children <- 
  eff_Time_actual_subjects_log_odds_Children %>% 
  rename(fit_link = prediction.fit, se_link = prediction.se.fit) %>% 
  mutate(fit_resp  = ilink(fit_link),
         right_upr = ilink(fit_link + (2 * se_link)), 
         right_lwr = ilink(fit_link - (2 * se_link)))

# plot probability for each actual subject and for typical subject
# (probability for typical subject is shown as a reference with dark grey dotted line)

plot_eff_Time_probability_Adults <- 
  eff_Time_actual_subjects_probability_Adults %>%
  ggplot(aes(x = Time, y = fit_resp, colour = Subject, fill = Subject)) +
  geom_ribbon(aes(x = Time, ymin = right_lwr, ymax = right_upr), 
              alpha = 0.3, size = 0.9) +
  geom_line(aes(x = Time, y = fit_resp), size = 0.9) +
  geom_line(data = eff_Time_typical_subject_probability_Adults, 
            colour = "#374E55", 
            size = 1, linetype = 2) + 
  facet_wrap(~Subject, nrow = 1) + 
  ggtitle(paste0("Group = ", "Adults")) + 
  easy_remove_legend() + 
  ylab("Probability") + 
  theme(panel.spacing = unit(2, "lines")) 

plot_eff_Time_probability_Adults 

plot_eff_Time_probability_Children <- 
  eff_Time_actual_subjects_probability_Children %>%
  ggplot(aes(x = Time, y = fit_resp, colour = Subject, fill = Subject)) +
  geom_ribbon(aes(x = Time, ymin = right_lwr, ymax = right_upr), 
              alpha = 0.3, size = 0.9) +
  geom_line(aes(x = Time, y = fit_resp), size = 0.9) +
  geom_line(data = eff_Time_typical_subject_probability_Children, 
            colour = "#374E55", 
            size = 1, linetype = 2) + 
  facet_wrap(~Subject, nrow = 1) + 
  ggtitle(paste0("Group = ", "Children")) + 
  easy_remove_legend() + 
  ylab("Probability") + 
  theme(panel.spacing = unit(2, "lines")) 

plot_eff_Time_probability_Children

plot_eff_Time_probability_Adults/plot_eff_Time_probability_Children
eric-pedersen commented 2 years ago

Hi @isabellaghement,

Great question! Give me a couple days to get back to you on this, though; I'm deep into final exams and grading right now, and I'd rather give this and the other fun GAM questions I have right now the time they deserve.

isabellaghement commented 2 years ago

Thank you so much for adding this question on your To Do list, Eric! I guess ultimately I am wondering what those random factor (aka non-linear) smooths integrate to when we take their expectation over all subjects represented by the ones included in the data. It doesn't seem to me like they are integrating to 0 (which would be the case with random intercepts or random slopes in lmer or glmer models), which makes it tricky to talk about "average" effect even when interpreting effects on the linear predictor scale.

Good luck with your final exams and grading - it will be good to get everything done and out of the way so that you can get into holiday mode. This has been a hard and exhausting term for everyone who has been teaching.