GEMINI-Medicine / Rgemini

A custom R package that provides a variety of functions to perform data analyses with GEMINI data
https://gemini-medicine.github.io/Rgemini/
Other
3 stars 0 forks source link

Cascading Frequency Table By Exclusion Criteria and Hospital #77

Open loffleraSMH opened 7 months ago

loffleraSMH commented 7 months ago

New Feature Request

Function creating table showing number/percent of encounters/rows at each inclusion/exclusion step during cohort creation.

Some initial code snippets can be found below. Ideally, we can add the following optional input arguments:

library(dplyr)
library(data.table)

set.seed(1)
my_data <- Rgemini::dummy_ipadmdad(100) %>% data.table()

cohort_creation_table <- function(cohort, stepnames = NULL, stepnames_short = NULL) {

  # get exclusion steps (will be shown as -n (-X%) to show what number/% of patients was excluded)
  excl <- grepl('^ex', stepnames_short, ignore.case = TRUE) 

  steps <- sapply(cohort, nrow)

  # create table
  tabs <- data.table(steps, lag(steps))
  tabs[, percent.change := round(1000*steps/V2)/10] 

  # for exclusion steps, show removal as -n (-X%) 
  if (sum(excl) > 0) {
    tabs[seq(sum(excl==F)+1, nrow(tabs)), percent.change := -1*round(1000*(1-steps/V2))/10] 
    tabs[seq(sum(excl==F)+1, nrow(tabs)), steps := -(V2-steps)] 
  }
  tabs[is.na(tabs)] <- ""
  tabs[, ':='(`N (%)` = ifelse(!is.na(percent.change), 
                               paste0(steps, " (", percent.change, "%)"), 
                               steps),
              steps = NULL, percent.change = NULL, V2 = NULL)]

  tabs <- cbind(stepnames, tabs)

  # add row with final cohort number (only needed if previous step was showing
  # exclusion
  if (excl[length(excl)] == TRUE) {
    tabs <- rbind(tabs, data.table(stepnames = "Final cohort",
                                   `N (%)` = nrow(cohort[[length(cohort)]])))
    stepnames_short <- c(stepnames_short, " ")
  }

  tabs <- cbind(stepnames_short, tabs)

  colnames(tabs) <- c("", "Cohort creation step", "N (%)")

  return(tabs)
}

cohort_table <- cohort_creation_table(
  cohort = list(my_data, 
                my_data[gender == "F"], 
                my_data[gender == "F" & age >= 80]),
  stepnames = c("All GEMINI encounters", 
                "Gender = Female", 
                "Age < 80"),
  stepnames_short = c("Incl. 1", "Incl. 2", "Excl. 1")
)

print(cohort_table)

Image

shijiaSMH commented 6 months ago

Recommend to leave % as optional or just keep numbers. I have PIs reporting number & percentage together is too overwhelming. I then only showed numbers, no such complaints after this change.

My codes here

create_inclusion.exclusion_table = function(cohort.formation, stepnames = NULL){

  sites = unique(cohort.formation[[length(cohort.formation)]][["hospital_num"]])
  sites <- c("Overall", sites)

  sites_list = vector("list")
  for (i in sites) {
    if (i == "Overall") {
    sites_list[[i]] <- cohort.formation
    } else {
    sites_list[[i]] <- lapply(cohort.formation,
                              function(x) x[hospital_num == i])
    }
  }

  sites_list_tabs = vector("list")
  for (i in sites) {
    steps <- sapply(sites_list[[i]], nrow)

    tabs = data.table(steps, lag(steps))
    tabs[, change := (V2-steps)]
    tabs[, percent.change := round(change/lag(steps)*100)]
    tabs[, ':='(N = ifelse(!is.na(change), 
                           paste0(steps), 
                           steps),
                change = NULL,
                steps = NULL,
                percent.change = NULL,
                V2 = NULL)]
    colnames(tabs) <- i

    sites_list_tabs[[i]] <- data.frame(tabs )
  }

  res <-  do.call(cbind, sites_list_tabs)
  rownames(res) = stepnames
  return(res)
}
gemini-wenb commented 4 months ago

Two suggestions:

1) It would be more flexible to return data.frame instead of kable. I feel that functions returning formatted html object is too restricted to fulfil different presentation needs. e.g. changing colnames, fonts etc. For publications, journals may ask for a different format of tables e.g. .csv, .pdf. So it will be more usable and flexible to return a data.frame object instead, and users can decide what packages to use to present the table in RMD as needed.

2) Would the logic be cleaner of we keep all criteria as exclusion criteria? For example "Inl. 1 Valid OHIP" could be modified to "Excl. 1 Invalid OHIP". So the data elimination steps follow a consistent logic.

shijiaSMH commented 4 months ago
  1. sounds right - data.frame or data.table would work
  2. I think it's decided out of the function (or it's just me)? What usually happened to me was trying to stick w whatever inc/exc mentioned in protocol, hoping to better resonate w PIs. However, I see ur point of pure exclusion makes things clearer
loffleraSMH commented 4 months ago
  1. Yep, sounds good - I changed the example code to remove the printing
  2. Yes, I think it's good to have some flexibility with that. Right now, the function assumes that everything is an inclusion step by default (everything with "^ex" in the short name is interpreted as a removal and the % changes is shown as -X%, but users can easily skip this by not providing any short names/provide short names that are all interpreted as inclusion steps). We could change the default to interpret everything as an exclusion instead (but then maybe the current use of -X% for exclusion steps probably doesn't make sense...). Let's discuss at the refinement meeting?
gemini-wenb commented 4 months ago

re: 2. Yes, as long as we provide users the flexibility to present the steps in different ways (all inclusion steps, all exclusion steps, or mix of inclusion and exclusion steps). There are pros and cons in any approach and the most suitable approach would be case dependent. e.g. when there are 10 steps and when we stratify this by hospitals, having a mix of inclusion and exclusion, where the % is presented as positive X% and negative -X% , can introduce confusion/complexity.