MSKCC-Epi-Bio / bstfun

Miscellaneous collection of functions
http://mskcc-epi-bio.github.io/bstfun
Other
34 stars 24 forks source link

Add a row selection mechanism that hides variables from an already created table #111

Closed dereksonderegger closed 2 years ago

dereksonderegger commented 2 years ago

In teams of data scientists, the creation of a table might belong to one group, but another is responsible for using the table in, for example, submission to regulatory agencies. Many times a gtsummary table might have extra variables included in the "official table" that gets validated, but aren't necessary to the current presentation or report. Instead, it would be great to be able to select only certain variables we want reporting on. Essentially, we want to hide some variables instead of rebuilding the table with a reduced variable set and possibly introducing errors and/or inconsistencies with the "official table".

Because this is hiding rows, I don't think footnotes or p-values or anything else should change.

It seems that tbl_split() is close to this, but is challenging to use if the variables to be kept are not adjacent to each other. As a first pass I wrote the following which seems to work. I could wrap this into a pull request if the approach I used (editing the table.body and meta.data ) is appropriate.

#' Filter row variables from gtsummary table
#'
#' \lifecycle{experimental}
#' The `tbl_filter` function subsets a single gtsummary table by filtering
#' out specific variables.
#'
#' @param x gtsummary table
#' @param keep A vector of variables to keep. 
#' @param remove A vector of variables to remove. If both `keep` and `remove` 
#'               set, then the `remove` list is ignored.
#' @param ... not used
#'
#' @return `tbl_summary` object
#'
#' @examples
#' tbl_summary(trial) %>%
#'   tbl_filter(keep = c(marker, grade))
#' tbl_summary(trial) %>%
#'   tbl_filter(remove = c(stage, marker, grade))
#'
#' @name tbl_filter
NULL

#' @export
#' @rdname tbl_filter
tbl_filter <- function(x, ...) {
  UseMethod("tbl_filter")
}

#' @export
#' @rdname tbl_split
tbl_filter.gtsummary <- function(x, keep=NULL, remove=NULL, ...) {
  check_dots_empty(error = function(e) inform(c(e$message, e$body)))

  # check/parse inputs ---------------------------------------------------------
  keep <-  broom.helpers::.select_to_varnames(
    {{ keep }},
    var_info = x$table_body,
    arg_name = "variable")
  remove <- broom.helpers::.select_to_varnames(
    {{ remove }},
    var_info = x$table_body,
    arg_name = "variable")

  # If remove is specified, figure out the keep variables --------------------
  if( is.null(keep) & !is.null(remove)){
    keep <- x$table_body |> 
      filter(!(variable %in% remove)) |> 
      pull(variable)
  }

  # Remove the variables from the `table_body` and `meta_data` ---------
  out <- x
  out$table_body <- out$table_body |> filter(variable %in% keep)
  out$meta_data  <- out$meta_data  |> filter(variable %in% keep)
  return(out)
}  

# This would be the testthat file
testthat::test_that('tbl_filter works with tbl_summary() objects', {
  A <- trial |>
    tbl_summary() |>
    tbl_filter( keep = c(trt, death))
  B <- trial |>
    select(trt, death) |>
    tbl_summary()
  testthat::expect_equal( 
    A |> as_tibble(),
    B |> as_tibble() )

  A <- trial |>
    tbl_summary() |>
    tbl_filter( remove = c(trt, death))
  B <- trial |>
    select(-trt, -death) |>
    tbl_summary()
  testthat::expect_equal( 
    A |> as_tibble(),
    B |> as_tibble() )

  A <- trial |>
    tbl_summary(by=trt) |> add_p() |>
    tbl_filter( keep = c(age, death))
  B <- trial |>
    select(trt, age, death) |>
    tbl_summary(by=trt) |> add_p()
  testthat::expect_equal( 
    A |> as_tibble(),
    B |> as_tibble() )

})
ddsjoberg commented 2 years ago

Dear @dereksonderegger , Thank you for the wonderfully detailed and thoughtful post. Below I've shown two somewhat simpler methods for filtering.

Overall I like the idea, but want to think on the API more before I would add it to the pkg however.

Anyway, I"ll leave this open and continue to think on it. Realistically it could be some time before I actually think about it more deeply! FYI

library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.6.1'

tbl <-
  trial %>%
  tbl_summary(by = trt) 

tbl %>%
  # only keep rows for age and grade
  modify_table_body(
    ~.x %>% dplyr::filter(variable %in% c("age", "grade"))
  ) %>%
  as_kable() # convert to kable to display in GH
Characteristic Drug A, N = 98 Drug B, N = 102
Age 46 (37, 59) 48 (39, 56)
Unknown 7 4
Grade
I 35 (36%) 33 (32%)
II 32 (33%) 36 (35%)
III 31 (32%) 33 (32%)
# now export as a function
tbl_filter <- function(x, expr) {
  x %>%
    modify_table_body(
      ~.x %>% dplyr::filter({{ expr }})
    )
}

tbl %>%
  tbl_filter(variable %in% c("age", "grade")) %>%
  as_kable() # convert to kable to display in GH
Characteristic Drug A, N = 98 Drug B, N = 102
Age 46 (37, 59) 48 (39, 56)
Unknown 7 4
Grade
I 35 (36%) 33 (32%)
II 32 (33%) 36 (35%)
III 31 (32%) 33 (32%)

Created on 2022-07-13 by the reprex package (v2.0.1)

ddsjoberg commented 2 years ago

PS @dereksonderegger if you're in the regulatory space, stay tuned for a gtsummary-based pkg for adverse event reporting. It'll be presented at the R/Medicine conference in Aug (which is super inexpensive to join--between $10 and $50 depending on your role).

dereksonderegger commented 2 years ago

All of your concerns make sense and I really appreciate your thoughtfulness.

The approach of using modify_table_body() is quite elegant/simple, but requires the user to know a bit about how gtsummary stores the information internally and that it is the right function to use for something like that. Ultimately that can be addressed through documentation, but my expectation is that this is the type of action needed by folks that aren't necessarily experts with gtsummary() and a user-friendly version might be nice.

Thanks for the heads up about the gtsummary based package for adverse events. That is one aspect of the reports I have to generate and have been toying with infrastructure to automate that using gtsummary. I'll wait to see what is already being built.

ddsjoberg commented 2 years ago

FYI @dereksonderegger i migrated this issue to bstfun, which serves somewhat as a sandbox for some gtsummary features.

ddsjoberg commented 2 years ago

@dereksonderegger what do you think of something like this? Removes all rows associated with the grade variable from the table. You can specify multiple variables and uses tidyselect

library(gtsummary)

trial %>%
  tbl_summary(include = c(age, grade)) %>%
  # remove all rows for grade
  remove_row_type(variable = grade, type = "all") %>%
  as_kable()
Characteristic N = 200
Age 47 (38, 57)
Unknown 11

Created on 2022-07-19 by the reprex package (v2.0.1)

dereksonderegger commented 2 years ago

This is probably the cleanest way to do it and has minimal impact on the existing code base. Nice! Thanks for thinking about this!