Cascading Frequency Table By Exclusion Criteria and Hospital

loffleraSMH commented 7 months ago

New Feature Request

Function creating table showing number/percent of encounters/rows at each inclusion/exclusion step during cohort creation.

Some initial code snippets can be found below. Ideally, we can add the following optional input arguments:

user-defined stratification variable (character) corresponding to separate columns in output table (typically, hospital_id or hospital_num, but could also be hospital_type etc. so let's keep it flexible)
show_prct flag allowing users to control whether to include percentages (default = TRUE), otherwise, will only show counts
[as discussed with Yishan, leave this for now to keep function logic simple: idx (numeric vector) that can be used to indicate if multiple steps are part of the same inclusion/exclusion step (i.e., they should have the same denominator), e.g., inclusion 2a and 2b for COVID vs. flu patients (both should be included but numbers should be listed separately, with percentages being based on cohort from inclusion step 1 in both cases; e.g. idx = c(1, 2, 2).]

library(dplyr)
library(data.table)

set.seed(1)
my_data <- Rgemini::dummy_ipadmdad(100) %>% data.table()

cohort_creation_table <- function(cohort, stepnames = NULL, stepnames_short = NULL) {

  # get exclusion steps (will be shown as -n (-X%) to show what number/% of patients was excluded)
  excl <- grepl('^ex', stepnames_short, ignore.case = TRUE) 

  steps <- sapply(cohort, nrow)

  # create table
  tabs <- data.table(steps, lag(steps))
  tabs[, percent.change := round(1000*steps/V2)/10] 

  # for exclusion steps, show removal as -n (-X%) 
  if (sum(excl) > 0) {
    tabs[seq(sum(excl==F)+1, nrow(tabs)), percent.change := -1*round(1000*(1-steps/V2))/10] 
    tabs[seq(sum(excl==F)+1, nrow(tabs)), steps := -(V2-steps)] 
  }
  tabs[is.na(tabs)] <- ""
  tabs[, ':='(`N (%)` = ifelse(!is.na(percent.change), 
                               paste0(steps, " (", percent.change, "%)"), 
                               steps),
              steps = NULL, percent.change = NULL, V2 = NULL)]

  tabs <- cbind(stepnames, tabs)

  # add row with final cohort number (only needed if previous step was showing
  # exclusion
  if (excl[length(excl)] == TRUE) {
    tabs <- rbind(tabs, data.table(stepnames = "Final cohort",
                                   `N (%)` = nrow(cohort[[length(cohort)]])))
    stepnames_short <- c(stepnames_short, " ")
  }

  tabs <- cbind(stepnames_short, tabs)

  colnames(tabs) <- c("", "Cohort creation step", "N (%)")

  return(tabs)
}

cohort_table <- cohort_creation_table(
  cohort = list(my_data, 
                my_data[gender == "F"], 
                my_data[gender == "F" & age >= 80]),
  stepnames = c("All GEMINI encounters", 
                "Gender = Female", 
                "Age < 80"),
  stepnames_short = c("Incl. 1", "Incl. 2", "Excl. 1")
)

print(cohort_table)

shijiaSMH commented 6 months ago

Recommend to leave % as optional or just keep numbers. I have PIs reporting number & percentage together is too overwhelming. I then only showed numbers, no such complaints after this change.

My codes here

create_inclusion.exclusion_table = function(cohort.formation, stepnames = NULL){

  sites = unique(cohort.formation[[length(cohort.formation)]][["hospital_num"]])
  sites <- c("Overall", sites)

  sites_list = vector("list")
  for (i in sites) {
    if (i == "Overall") {
    sites_list[[i]] <- cohort.formation
    } else {
    sites_list[[i]] <- lapply(cohort.formation,
                              function(x) x[hospital_num == i])
    }
  }

  sites_list_tabs = vector("list")
  for (i in sites) {
    steps <- sapply(sites_list[[i]], nrow)

    tabs = data.table(steps, lag(steps))
    tabs[, change := (V2-steps)]
    tabs[, percent.change := round(change/lag(steps)*100)]
    tabs[, ':='(N = ifelse(!is.na(change), 
                           paste0(steps), 
                           steps),
                change = NULL,
                steps = NULL,
                percent.change = NULL,
                V2 = NULL)]
    colnames(tabs) <- i

    sites_list_tabs[[i]] <- data.frame(tabs )
  }

  res <-  do.call(cbind, sites_list_tabs)
  rownames(res) = stepnames
  return(res)
}

gemini-wenb commented 4 months ago

Two suggestions:

1) It would be more flexible to return data.frame instead of kable. I feel that functions returning formatted html object is too restricted to fulfil different presentation needs. e.g. changing colnames, fonts etc. For publications, journals may ask for a different format of tables e.g. .csv, .pdf. So it will be more usable and flexible to return a data.frame object instead, and users can decide what packages to use to present the table in RMD as needed.

2) Would the logic be cleaner of we keep all criteria as exclusion criteria? For example "Inl. 1 Valid OHIP" could be modified to "Excl. 1 Invalid OHIP". So the data elimination steps follow a consistent logic.

shijiaSMH commented 4 months ago

sounds right - data.frame or data.table would work
I think it's decided out of the function (or it's just me)? What usually happened to me was trying to stick w whatever inc/exc mentioned in protocol, hoping to better resonate w PIs. However, I see ur point of pure exclusion makes things clearer

loffleraSMH commented 4 months ago

Yep, sounds good - I changed the example code to remove the printing
Yes, I think it's good to have some flexibility with that. Right now, the function assumes that everything is an inclusion step by default (everything with "^ex" in the short name is interpreted as a removal and the % changes is shown as -X%, but users can easily skip this by not providing any short names/provide short names that are all interpreted as inclusion steps). We could change the default to interpret everything as an exclusion instead (but then maybe the current use of -X% for exclusion steps probably doesn't make sense...). Let's discuss at the refinement meeting?

gemini-wenb commented 4 months ago

re: 2. Yes, as long as we provide users the flexibility to present the steps in different ways (all inclusion steps, all exclusion steps, or mix of inclusion and exclusion steps). There are pros and cons in any approach and the most suitable approach would be case dependent. e.g. when there are 10 steps and when we stratify this by hospitals, having a mix of inclusion and exclusion, where the % is presented as positive X% and negative -X% , can introduce confusion/complexity.

GEMINI-Medicine / Rgemini

Cascading Frequency Table By Exclusion Criteria and Hospital #77

New Feature Request