ddsjoberg / gtsummary

Presentation-Ready Data Summary and Analytic Result Tables
http://www.danieldsjoberg.com/gtsummary
Other
1.03k stars 113 forks source link

Feature request: native interaction analyses #1241

Closed zaddyzad closed 2 years ago

zaddyzad commented 2 years ago

The gtsummary package has been incredibly helpful for conducting univariable regression with the tbl_uvregression() function.

However, it would be immensely helpful if the function allowed users to specify an interaction term (e.g., age or sex). Ideally, this argument would cause the tbl_uvregression() results to be displayed according to interaction term that was selected.

Currently, interaction analyses with gtsummary is quite difficult. What I have used as a work around is Woolf's Test For Homogeneity. I use it as follows:

p_load(DescTools); xtabs(~Response + Independent_Variable + Interaction_Term) %>% WoolfTest

ddsjoberg commented 2 years ago

Hey @zaddyzad ! Thanks for the post and the suggestion.

Is a table that looks like this acceptable for the interaction analysis, i.e. a logistic regression interaction or are you suggesting incorporating Woolf's test somehow?

library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.6.0'

tbl <-
  tibble(variable = c("age", "marker")) %>%
  rowwise() %>%
  mutate(
    formula = 
      str_glue("response ~ trt + {variable} + trt:{variable}") %>%
      as.formula() %>%
      list(),
    glm = glm(formula, data = trial, family = binomial) %>% list(),
    tbl = 
      # only show the interaction term
      tbl_regression(glm, include = contains(":"), show_single_row = contains(":")) %>% 
      list()
  ) %>%
  pull(tbl) %>%
  tbl_stack()

image

Created on 2022-05-09 by the reprex package (v2.0.1)

zaddyzad commented 2 years ago

Thank you for the hasty reply @ddsjoberg !

What you provided is really great. However, what is still difficult is including interaction terms that categorical. For example, in the data set that I am currently working on, we have gender (3 levels) and race (6 levels) as interaction terms we would like to evaluate.

When I try your code above with two different categorical interaction terms, grade and stage in trial, it returns an error (below). What would be brilliant is if the proposed feature also returned stratum specific odds ratio in the case of interaction terms that are categorical.

tbl <-
  tibble(variable = c("grade", "stage")) %>%
  rowwise() %>%
  mutate(
    formula = 
      str_glue("response ~ trt + {variable} + trt:{variable}") %>%
      as.formula() %>%
      list(),
    glm = glm(formula, data = trial, family = binomial) %>% list(),
    tbl = 
      # only show the interaction term
      tbl_regression(glm, include = contains(":"), show_single_row = contains(":")) %>% 
      list()
  ) %>%
  pull(tbl) %>%
  tbl_stack()

#> x Variable(s) "trt:grade" were incorrectly requested to be printed on a single row.
Error in `mutate()`:
! Problem while computing `tbl = ... %>% list()`.
i The error occurred in row 1.
Caused by error:
! Incorrect call with `show_single_row=`. Quitting execution."

Created on 2022-05-10 by the reprex package (v2.0.1)

ddsjoberg commented 2 years ago

You can remove the show_single_row= argument, and it works with categorical data as well.

library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.6.0'

tbl <-
  tibble(variable = c("grade", "age")) %>%
  rowwise() %>%
  mutate(
    formula = 
      str_glue("response ~ trt + {variable} + trt:{variable}") %>%
      as.formula() %>%
      list(),
    glm = glm(formula, data = trial, family = binomial) %>% list(),
    tbl = 
      # only show the interaction term
      tbl_regression(glm, include = contains(":")) %>% 
      list()
  ) %>%
  pull(tbl) %>%
  tbl_stack()

image

Created on 2022-05-10 by the reprex package (v2.0.1)

ddsjoberg commented 2 years ago

I am trying to think how something like this could be incorporated into tbl_uvregression(), but no smooth API is coming to mind..... 🤔

zaddyzad commented 2 years ago

Would it be possible to do something such that a p-interaction term is also provided in perhaps another column?

zaddyzad commented 2 years ago

The p-interaction is why I mentioned Woolf's test earlier on.

ddsjoberg commented 2 years ago

You can add a global p-value with add_global_p(). With the keep= you can choose whether to include or exclude the individual p-values.

library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.6.0'

tbl <-
  tibble(variable = c("grade", "age")) %>%
  rowwise() %>%
  mutate(
    formula = 
      str_glue("response ~ trt + {variable} + trt:{variable}") %>%
      as.formula() %>%
      list(),
    glm = glm(formula, data = trial, family = binomial) %>% list(),
    tbl = 
      # only show the interaction term
      tbl_regression(glm, include = contains(":")) %>% 
      add_global_p(keep = FALSE) %>% # keep=TRUE includes individual p-values
      list()
  ) %>%
  pull(tbl) %>%
  tbl_stack()

image

Created on 2022-05-10 by the reprex package (v2.0.1)

ddsjoberg commented 2 years ago

Anyway, if you have any suggestions for an API that will get you this table that fits within tbl_uvregression() let me know. Otherwise, the code above isn't too bad to use.

zaddyzad commented 2 years ago

Thank you so much for the advice @ddsjoberg.

Finally, I wondered how could this be modified to also include the OR for Grade I? I suppose what I still have trouble seeing is how the package can be used to simultaneously do a subgroup analysis and test for the significance of an interaction term, which divides the stratum.

You can remove the show_single_row= argument, and it works with categorical data as well.

library(gtsummary)
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.6.0'

tbl <-
  tibble(variable = c("grade", "age")) %>%
  rowwise() %>%
  mutate(
    formula = 
      str_glue("response ~ trt + {variable} + trt:{variable}") %>%
      as.formula() %>%
      list(),
    glm = glm(formula, data = trial, family = binomial) %>% list(),
    tbl = 
      # only show the interaction term
      tbl_regression(glm, include = contains(":")) %>% 
      list()
  ) %>%
  pull(tbl) %>%
  tbl_stack()

image

Created on 2022-05-10 by the reprex package (v2.0.1)

ddsjoberg commented 2 years ago

This is showing the interaction term from a logistic regression model (a model that includes main effects). These are not ORs for Grade I, II, or III (stratified or not): just the slope coefs for the interaction terms.

zaddyzad commented 2 years ago

I see. In that case, could subgroupAnalysis() from library(Publish) be possibly integrated as an API into gtsummary for native subgroup analyses with interaction tests?

ddsjoberg commented 2 years ago

I think this is what you're after? You can use tbl_strata() rather than tbl_uvregression()

library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.6.0'

# calculate het p-value
p.value.heterogeneity <-
  with(trial, xtabs(~response + trt + grade)) |> 
  DescTools::WoolfTest() %>%
  broom::tidy() %>%
  dplyr::pull(p.value) %>%
  style_pvalue(prepend_p = TRUE)

# calculate stratified ORs for trt
tbl <-
  trial %>%
  mutate(grade = paste("Grade", grade)) %>%
  tbl_strata(
    grade, 
    ~ glm(response ~ trt, data = .x, family = binomial) %>%
      tbl_regression(exponentiate = TRUE),
    .header = "{strata}, N = {n}",
    .combine_with = "tbl_stack"
  ) %>%
  # add footnote for het p-value
  modify_footnote(
    label = stringr::str_glue("Woolf Test for Heterogeneity, {p.value.heterogeneity}"))

image

Created on 2022-05-10 by the reprex package (v2.0.1)

zaddyzad commented 2 years ago

This is quite close! However, when working with survey data, we often have multiple "treatments" that we're testing. For example, the project I'm working on now is regressing frequency of mHealth application use on different attitudinal variables (agree or disagree).

Could this code be modified such that tbl_uvregression can integrate subgroup analyses with interaction analyses?

The output would be similar to:

Row 1, Survey Q1 * Sex, OR --, 95% CI --, p-value = Woolf's Test (or something else) Row 2, indented Female, OR X.XX, 95% CI X.XX Row 3, indented Male, OR X.XX, 95% CI X.XX

Row 4, Survey Q2 * Sex, OR --, 95% CI --, p-value = Woolf's Test (or something else) Row 5, indented Female, OR X.XX, 95% CI X.XX Row 6, indented Male, OR X.XX, 95% CI X.XX

and so on...

ddsjoberg commented 2 years ago

If you can create a reproducible example showing exactly what you're looking for and the code to create those stats, I can help further. It seems I have not been good at guessing ;)

ddsjoberg commented 2 years ago

Maybe this slide will be helpful http://www.danieldsjoberg.com/gtsummary-Louisiana-State-University-Presentation/#41

zaddyzad commented 2 years ago

I am able to create a reproducible example with library(Publish). How could this be coerced into gtsummary?

library(Publish)
#> Loading required package: prodlim
library(prodlim)
library(tidyverse)
library(gtsummary)

trial <- trial

data.table::setDT(trial)
trial$grade <- as.factor(trial$grade)
trial$stage <- as.factor (trial$stage)
trial$trt <- as.factor (trial$trt)
trial$response <- as.factor (trial$response)

glm_fit1 <- 
  tibble(variable=c("trt","response")) %>%
  rowwise() %>%
  mutate(
    formula = 
      str_glue("death ~ {variable}") %>%
      as.formula() %>% list(),

    glm = 
      glm(formula,data=trial,family=binomial) %>% list(),

    sub_glm = subgroupAnalysis(glm,data=trial,treatment={variable}, subgroups=c("grade")) %>% list()
    ); pull(glm_fit1)
#> [[1]]
#>    subgroups level sample_Drug A sample_Drug B event_Drug A event_Drug B
#> 1:     grade     I            35            33           16           17
#> 2:     grade    II            32            36           16           20
#> 3:     grade   III            31            33           20           23
#>    OddsRatio pinteraction subgroup Lower    confint
#> 1:  1.261719            1     <NA> 0.486 0.486-3.27
#> 2:  1.250000            1     <NA> 0.481 0.481-3.25
#> 3:  1.265000            1     <NA> 0.445 0.445-3.60
#> 
#> [[2]]
#>    subgroups level sample_NA sample_0 sample_1 event_NA event_0 event_1
#> 1:     grade     I         1       46       21        1      25       7
#> 2:     grade    II         5       44       19        3      28       5
#> 3:     grade   III         1       42       21        1      30      12
#>    OddsRatio pinteraction subgroup Lower     confint
#> 1: 0.4200000        0.476     <NA> 0.143 0.143-1.233
#> 2: 0.2040816        0.476     <NA> 0.062 0.062-0.672
#> 3: 0.5333333        0.476     <NA> 0.179 0.179-1.591

Created on 2022-05-11 by the reprex package (v2.0.1)

ddsjoberg commented 2 years ago

What results from tbl_uvregression() do you want to appear in the table? The output from glm_fit1$sub_glm contains no information that overlaps with the regression output?

Did you take a look at the slide I linked to?

zaddyzad commented 2 years ago

The slides were very helpful. Tbl_strata was great for descriptive tables I was working on.

In the case of uvregression, the interaction analysis and subgroup analysis has been done with subgroupAnalysis.

What I wondered is whether the output from glm_fit1$sub_glm could be coerced into a gtsummary object such that it is summarized in a publication ready format.

ddsjoberg commented 2 years ago

The package is pretty generalizable, so we can construct it. Here's some code you could funcitonalize if this is a table you prepare often.

library(gtsummary)
library(Publish)
#> Loading required package: prodlim
library(tidyverse)
packageVersion("gtsummary")
#> [1] '1.6.0'

trial2 <-
  trial %>%
  mutate(across(c(grade, stage, trt, response), as.factor))

glm_fit1 <- 
  tibble(variable=c("trt","response")) %>%
  rowwise() %>%
  mutate(
    # build univariate regression model
    formula = str_glue("death ~ {variable}"),
    glm = 
      glm(as.formula(formula), data = trial2, family = binomial) %>% list(),
    # perform subgroup analysis
    sub_glm = 
      subgroupAnalysis(glm, data = as.data.frame(trial2), treatment = variable, subgroups = "grade") %>% list(),
    p.value.interaction = sub_glm$pinteraction[[1]],
    # create gtsummary of the regression model
    gts_glm = 
      tbl_regression(
        glm, 
        exponentiate = TRUE, 
        label = list(attr(trial[[variable]], "label")) %>% set_names(variable)
      ) %>%
      list(),
    # build a separate gtsummary table of the subgroup analyses
    gts_sub_glm =
      sub_glm %>%
      as_tibble() %>%
      select(label = level, estimate = OddsRatio, conf.low = Lower, 
             conf.high = Upper, p.value = Pvalue) %>%
      add_row(label = "Subgroup Analysis", p.value = p.value.interaction, .before = 0L) %>%
      mutate(
        label = as.character(label),
        ci =
          ifelse(row_number() > 1L,
                 str_glue("{style_ratio(conf.low)}, {style_ratio(conf.high)}"),
                 NA),
        variable = variable,
        row_type = ifelse(row_number() == 1L, "label", "level")
      ) %>%
      # convert the tibble to gtsummary table
      .create_gtsummary_object() %>%
      modify_column_indent(
        columns = label, rows = row_type == "level", double_indent = TRUE) %>%
      modify_column_indent(
        columns = label, rows = row_type == "label", double_indent = FALSE) %>%
      modify_fmt_fun(
        list(c(estimate, conf.low, conf.high) ~ style_ratio,
             p.value ~ style_pvalue)
      ) %>%
      modify_table_styling(
        columns = p.value, rows = row_type == "label", footnote = "Hetergeneity p-value") %>%
      list(),
    # stack the main and subgroup gtsummary tables
    tbl_combined =
      tbl_stack(list(gts_glm, gts_sub_glm), quiet = TRUE) %>%
      list()
  )

# stack all tables together
tbl_final <- 
  glm_fit1$tbl_combined %>%
  tbl_stack()

image

Created on 2022-05-14 by the reprex package (v2.0.1)

ddsjoberg commented 2 years ago

Hey @zaddyzad ! What do you see as the next steps here?

ddsjoberg commented 2 years ago

@zaddyzad please feel free to re-open this issue if you'd like to continue the conversation