Causal-LDA / TrialEmulation

https://causal-lda.github.io/TrialEmulation/
Apache License 2.0
14 stars 6 forks source link

check example from course #102

Open gravesti opened 1 year ago

gravesti commented 1 year ago

Recreate results from stata code https://github.com/bldestavola/TTE-Short-Course

krz commented 1 month ago

Hi,

I'm trying to understand this package better so I attempted to replicate the analysis from the short course you linked.

datalong <- as.data.frame(haven::read_dta("https://github.com/bldestavola/TTE-Short-Course/raw/main/Data/Practical2_contY_long.dta")
datalong$eligible <- 1

head(datalong)
  id t A    Y    L eligible
1  1 0 0 1.64 0.70        1
2  1 1 0 1.64 1.85        1
3  2 0 1 6.08 2.21        1
4  2 1 0 6.08 3.83        1
5  3 0 1 5.48 1.49        1
6  3 1 1 5.48 2.72        1

then I did

result <- initiators(
  data = datalong,
  id = "id",
  period = "t",
  eligible = "eligible",
  treatment = "A",
  estimand_type = "ITT",
  outcome = "Y",
  model_var = "assigned_treatment",
  outcome_cov = c("L"),
  use_censor_weights = FALSE
)
tidy(result$model)
# A tibble: 7 × 5
  term                estimate std.error statistic p.value
  <chr>                  <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)        -2.66e+ 1     4476. -5.94e- 3   0.995
2 assigned_treatment  3.39e-14     4122.  8.23e-18   1    
3 trial_period        3.93e-14     5510.  7.13e-18   1    
4 I(trial_period^2)  NA              NA  NA         NA    
5 followup_time       3.85e-14     5036.  7.64e-18   1    
6 I(followup_time^2) NA              NA  NA         NA    
7 L                   2.69e-15     1762.  1.53e-18   1  

but I have no idea how to interpret this. In the short course they found a treatment effect of around 1.11 for this example.

gravesti commented 1 month ago

Hi @krz, I never made progress on this issue because this package does not (yet) implement the models shown in that course. We have so far focused on survival outcomes where time 0 is not well defined, and a sequence of target trials modelling approach. We can do something similar to what is shown in practical 3, but we don't have IPTW. I'd be happy to try to explore this again.

krz commented 1 month ago

thanks for the quick response @gravesti . Would it be possible to replicate the García-Albéniz example in this post using your package?

gravesti commented 1 month ago

Yes, this is exactly the kind of analysis we can do. @krz Here is a sketch of the code based on that data

library(tibble)
library(dplyr)
library(tidyr)
library(ggplot2)
library(TrialEmulation)

df <- tibble::tribble(
  ~id, ~trial,             ~arm, ~fup_start, ~fup_end, ~cancer, ~covariate,
  "02",     0L, "No Colonoscopy",         0L,       9L,      0L,         1L,
  "03",     0L,    "Colonoscopy",         0L,       1L,      1L,         0L,
  "06",     0L, "No Colonoscopy",         0L,       6L,      0L,         1L,
  "02",     1L, "No Colonoscopy",         1L,       9L,      0L,         1L,
  "05",     1L, "No Colonoscopy",         1L,       5L,      1L,         0L,
  "06",     1L, "No Colonoscopy",         1L,       6L,      0L,         1L,
  "01",     2L, "No Colonoscopy",         2L,       4L,      0L,         2L,
  "05",     2L, "No Colonoscopy",         2L,       5L,      1L,         0L,
  "06",     2L, "No Colonoscopy",         2L,       6L,      0L,         1L,
  "01",     3L, "No Colonoscopy",         3L,       4L,      0L,         2L,
  "05",     3L, "No Colonoscopy",         3L,       5L,      1L,         0L,
  "06",     3L, "No Colonoscopy",         3L,       6L,      0L,         1L,
  "04",     4L, "No Colonoscopy",         4L,       6L,      1L,         1L,
  "05",     4L,    "Colonoscopy",         4L,       5L,      1L,         0L,
  "06",     4L, "No Colonoscopy",         4L,       6L,      0L,         2L,
  "07",     4L, "No Colonoscopy",         4L,       9L,      0L,         0L,
  "04",     5L, "No Colonoscopy",         5L,       6L,      1L,         2L,
  "06",     5L, "No Colonoscopy",         5L,       6L,      0L,         2L,
  "07",     5L, "No Colonoscopy",         5L,       9L,      0L,         0L,
  "06",     6L, "No Colonoscopy",         6L,       6L,      0L,         2L,
  "07",     6L, "No Colonoscopy",         6L,       9L,      0L,         0L,
  "07",     7L, "No Colonoscopy",         7L,       9L,      0L,         1L,
  "07",     8L, "No Colonoscopy",         8L,       9L,      0L,         1L,
  "07",     9L, "No Colonoscopy",         9L,       9L,      0L,         2L
)

# convert to longitudinal format
long_df <- df |> 
  full_join(expand_grid(id = unique(df$id), trial = 0L:9L), by = c("id", "trial")) |> # add all followup time
  group_by(id) |> 
  filter(min(fup_start, na.rm = TRUE) <= trial) |> # remove any periods before first eligibility of each id
  arrange(id, trial) |> # sort
  fill(fup_start, fup_end, cancer, covariate, arm, .direction = "down") |> # fill in values for all followup time
  mutate(
    eligible = if_else(trial == fup_start, 1L, 0L), # determine if eligible to start a trial at this time
    arm_binary = if_else(arm == "Colonoscopy", 1, 0),
    cancer_outcome = if_else(trial == fup_end, cancer, 0L) # if this is the final period, what is the outcome
  ) |> 
  rename(period = trial) # we are looking now at time, not "trial"

trial_data <- data_preparation(
  data = long_df,
  id = "id",
  period = "period",
  treatment = "arm_binary",
  outcome = "cancer_outcome",
  estimand_type = "ITT",
  outcome_cov = ~ covariate
)

trial_data$data # you can see we have rederived the trial sequence data but in a pool-logistic compatible format

msm_result <- trial_msm(
  data = trial_data,
  outcome_cov = ~ covariate,
  include_followup_time = ~followup_time + I(followup_time^2),
  include_trial_period = ~1
)  

pred_data <- trial_data$data |> 
  group_by(id) |> 
  filter(trial_period == min(trial_period)) # take the first time period per id

# Now do g-computation to get marginal effect on population
preds <- predict.TE_msm(msm_result, newdata = pred_data, predict_times = 0:9)

# plot the difference in cumulative incidence
ggplot(preds$difference, aes(x = followup_time, y = cum_inc_diff, ymin = `2.5%`, ymax = `97.5%`)) +
  geom_ribbon(fill = "blue", alpha = 0.25) +  geom_line()
krz commented 1 month ago

thank you so much @gravesti , that helped me understand it better. How do I interpret the result, is this a varying effect over time for each period? Can I get an overall treatment effect or hazard ratio?

gravesti commented 1 month ago

You can see the cumulative incidence (or survival) curves for each treatment strategy in the preds object. So you can interpret the results in preds$difference as a risk difference up to each time point, eg the difference in probability of surviving up to time=9.

(Alternatively, you might consider that the odds ratio estimate from the glm could approximate a hazard ratio as in Danaei et al https://journals.sagepub.com/doi/10.1177/0962280211403603)

There is some discussion in our draft manuscript: https://arxiv.org/pdf/2402.12083 see ~p3