amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
428 stars 107 forks source link

[Question] Is there any way to pool Kaplan-Meier curves (including surv. probability and getting their CIs)? #545

Closed Generalized closed 1 year ago

Generalized commented 1 year ago

Dear Authors of mice.

I have a big imputed dataset of many variables. Some of them are used to derive a compound endpoint, let's call it "success of therapy". Now I want to plot the curves the cumulative probability of the success (= just 1-Kaplan-Meier survival) for two treatments, surrounded by their pointwise CIs.

The problem is I have several imputed datasets. Pooling works out of the box with Cox, but I want assumption-free estimator across 2 groups, the Kaplan-Meier.

Let's assume this trivial example. TIt has no gaps in survival or the used predictor - for the sake simplicity in this example. I want to see, if multiply generated the same dataset will give - after pooling - the same results as the raw approach.

Of course my REAL data will have some gaps. After the imputation the status will vary a bit between the samples. Times will be always the same.

So this my exemplary data. Not much meaningful, but suffices:

surv_data <- as.data.frame(list(time=c(4,3,1,1,2,2,3,5,2,4,5,1), 
                            status=c(1,1,1,0,1,1,0,0,1,1,0,0), 
                            x=c(0,2,1,1,4,7,0,1,1,2,0,1), 
                            sex=c(0,0,0,0,1,1,1,1,0,1,0,0)))

# Let's "impute" (generate) just 10 identical datasets. All will have same survival (no imputation in time, status and sex
imp <- mice(surv_data,m=10)

Should it be something like this?

km_imp <- with(imp, with(summary(survfit(Surv(time, status) ~ sex), times = 1:5, extend = TRUE),
                          data.frame(strata, time, surv, std.err, lower, upper)))

> km_imp$analyses %>% 
  bind_rows() %>%
  group_by(strata, time) %>% 
  summarize(pooled = data.frame(pool.scalar(Q=surv, U=std.err^2, n=12, k = 1)[c("qbar", "t")])) %>% 
  unpack(pooled) %>% 
  mutate(surv=qbar, SE = sqrt(t), LCI = qbar - 1.96*SE, UCI=qbar + 1.96*SE)

   strata  time  qbar      t  surv    SE     LCI   UCI
   <fct>  <int> <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl>
 1 sex=0      1 0.857 0.0175 0.857 0.132  0.598  1.12 
 2 sex=0      2 0.643 0.0443 0.643 0.210  0.230  1.06 
 3 sex=0      3 0.429 0.0503 0.429 0.224 -0.0110 0.868
 4 sex=0      4 0.214 0.0355 0.214 0.188 -0.155  0.584
 5 sex=0      5 0.214 0.0355 0.214 0.188 -0.155  0.584
 6 sex=1      1 1     0      1     0      1      1    
 7 sex=1      2 0.6   0.048  0.6   0.219  0.171  1.03 
 8 sex=1      3 0.6   0.048  0.6   0.219  0.171  1.03 
 9 sex=1      4 0.3   0.057  0.3   0.239 -0.168  0.768
10 sex=1      5 0.3   0.057  0.3   0.239 -0.168  0.768

which agrees with the result from complete data:

> with(summary(survfit(Surv(time, status) ~ sex, data=surv_data, conf.type="plain"), times = 1:5, extend = TRUE),
+      data.frame(strata, time, surv, lower, upper))
   strata time   surv  lower  upper
1   sex=0    1 0.8571 0.5979 1.0000
2   sex=0    2 0.6429 0.2304 1.0000
3   sex=0    3 0.4286 0.0000 0.8681
4   sex=0    4 0.2143 0.0000 0.5837
5   sex=0    5 0.2143 0.0000 0.5837
6   sex=1    1 1.0000 1.0000 1.0000
7   sex=1    2 0.6000 0.1706 1.0000
8   sex=1    3 0.6000 0.1706 1.0000
9   sex=1    4 0.3000 0.0000 0.7679
10  sex=1    5 0.3000 0.0000 0.7679

The negatives come from the normal approximation. I'll try to transform it later.

I tried also the "veteran" dataset.

> veteran = survival::veteran
> imp <- mice(veteran,m=10)
> km_imp <- with(imp, with(summary(survfit(Surv(time, status) ~ trt, data=veteran), times = c(3, 50, 100), extend = TRUE),
                          data.frame(strata, time, surv, std.err, lower, upper)))

km_imp$analyses %>% 
  bind_rows() %>%
  group_by(strata, time) %>% 
  summarize(pooled = data.frame(pool.scalar(Q=surv, U=std.err^2, n=Inf, k = 1)[c("qbar", "t")])) %>% 
  unpack(pooled) %>% 
  mutate(surv=qbar, SE = sqrt(t), LCI = qbar - 1.96*SE, UCI=qbar + 1.96*SE)

  strata  time  qbar        t  surv     SE   LCI   UCI
  <fct>  <dbl> <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1 trt=1      3 0.986 0.000207 0.986 0.0144 0.957 1.01 
2 trt=1     50 0.680 0.00317  0.680 0.0563 0.569 0.790
3 trt=1    100 0.502 0.00368  0.502 0.0606 0.383 0.621
4 trt=2      3 0.956 0.000620 0.956 0.0249 0.907 1.00 
5 trt=2     50 0.559 0.00363  0.559 0.0602 0.441 0.677
6 trt=2    100 0.333 0.00334  0.333 0.0578 0.219 0.446

> with(summary(survfit(Surv(time, status) ~ trt, data=veteran, conf.type="plain"), times = c(3, 50, 100), extend = TRUE),
     data.frame(strata, time, surv, lower, upper))

  strata time   surv  lower  upper
1  trt=1    3 0.9855 0.9573 1.0000
2  trt=1   50 0.6797 0.5693 0.7902
3  trt=1  100 0.5020 0.3831 0.6208
4  trt=2    3 0.9559 0.9071 1.0000
5  trt=2   50 0.5588 0.4408 0.6768
6  trt=2  100 0.3326 0.2195 0.4458

Good! They agree being are the same datasetes. Just wanted to check if the construction is correct, so when will pass truly different imputed datasets it should work too.

There is a minor rounding issue at upper CI or it was truncated by the survfit summary equally at 1.

I tried also the log-log to avoid these artefacts:

> km_imp$analyses %>% 
  bind_rows() %>%
  mutate(surv_log = log(-log(surv))) %>%
  group_by(strata, time) %>% 
  summarize(pooled = data.frame(pool.scalar(Q=surv_log, U=std.err^2, n=12, k = 1)[c("qbar", "t")])) %>% 
  unpack(pooled) %>% 
  mutate(surv=exp(-exp(qbar)), SE = (sqrt(t)/surv)/log(surv), 
         LCI = exp(-exp(qbar - 1.96*SE)) , UCI=exp(-exp(qbar + 1.96*SE)))

   strata  time     qbar        t  surv      SE       LCI     UCI
   <fct>  <int>    <dbl>    <dbl> <dbl>   <dbl>     <dbl>   <dbl>
 1 sex=0      1   -1.87    0.0175 0.857  -1.00    0.334     0.979
 2 sex=0      2   -0.817   0.0443 0.643  -0.741   0.151     0.902
 3 sex=0      3   -0.166   0.0503 0.429  -0.618   0.0583    0.777
 4 sex=0      4    0.432   0.0355 0.214  -0.571   0.00894   0.605
 5 sex=0      5    0.432   0.0355 0.214  -0.571   0.00894   0.605
 6 sex=1      1 -Inf     NaN      1     NaN     NaN       NaN    
 7 sex=1      2   -0.672   0.048  0.6    -0.715   0.126     0.882
 8 sex=1      3   -0.672   0.048  0.6    -0.715   0.126     0.882
 9 sex=1      4    0.186   0.057  0.3    -0.661   0.0123    0.719
10 sex=1      5    0.186   0.057  0.3    -0.661   0.0123    0.719

> with(summary(survfit(Surv(time, status) ~ sex, data=surv_data, conf.type="log-log"), times = 1:5, extend = TRUE),
     data.frame(strata, time, surv, lower, upper))

   strata time   surv    lower  upper
1   sex=0    1 0.8571 0.334054 0.9786
2   sex=0    2 0.6429 0.151467 0.9017
3   sex=0    3 0.4286 0.058274 0.7768
4   sex=0    4 0.2143 0.008937 0.6047
5   sex=0    5 0.2143 0.008937 0.6047
6   sex=1    1 1.0000 1.000000 1.0000
7   sex=1    2 0.6000 0.125730 0.8818
8   sex=1    3 0.6000 0.125730 0.8818
9   sex=1    4 0.3000 0.012302 0.7192
10  sex=1    5 0.3000 0.012302 0.7192

Perfect agreement.

But isn't it too simple, too naive? Would you accept such solution? For pooling with Rubin's rules, we assume approximate normality, and I heard the complementary log-log transformation should be applied to the survival probability. This is not the same as the log-log method of obtaining CIs for the survival probability used in survdif(), but related - it's the "mirror". What's your opinion on cloglog vs. log-log?

PS: If so, maybe you might consider adding pooling for KM estimates just to shorten the calculations and minimize the risk of errors/typos?