chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

flexsurvreg ignores factors' contrast matrices? #178

Closed anddis closed 5 months ago

anddis commented 6 months ago

Hello Chris,

flexsurvreg seems to ignore the contrast matrix associated with a factor.

In the reproducible toy example at the end of this message, the object fit.c ignored the contrast matrix associated with the celltype2 factor variable. In fact, the estimates were similar (but not identical!) to those in fit.noc, where no contrast matrix was associated to the factor specified on the rhs of the formula passed to flexsurvreg (celltype).

As a benchmark, I used the glm function, which correctly handle contrast matrices, to fit an equivalent Poisson model (fit.poi).

                     fit.noc      fit.c    fit.poi  empirical
rate              -5.4206777 -5.4206576 -4.7764993 -4.7764993
celltypesmallcell  1.0841135  1.0840659  1.0841135  1.0841135
celltypeadeno      1.2223197  1.2222489  0.1382062  0.1382062
celltypelarge      0.2702805  0.2702119 -0.9520392 -0.9520392

Is there anything I should be aware of to use contrast matrices with flexsurvreg?

Thank you for this great package!

library(survival)
library(flexsurv)
library(MASS)

veteran$celltype <- factor(veteran$celltype)
fit.noc <- flexsurvreg(Surv(time, status) ~ celltype,
                       data = veteran, 
                       dist = "exp")

veteran$celltype2 <- veteran$celltype
contrasts(veteran$celltype2) <- MASS::contr.sdif(4)
fit.c <- flexsurvreg(Surv(time, status) ~ celltype2,
                       data = veteran, 
                       dist = "exp")

fit.poi <- glm(status ~ celltype2 + offset(log(time)),
               data = veteran,
               family = "poisson")

log.ir <- log(tapply(veteran$status, veteran$celltype, sum) / 
  tapply(veteran$time, veteran$celltype, sum))

cbind(fit.noc = coef(fit.noc), 
      fit.c = coef(fit.c), 
      fit.poi = coef(fit.poi), 
      empirical = c(mean(log.ir), diff(log.ir)))
chjackson commented 6 months ago

I think this should now be fixed in https://github.com/chjackson/flexsurv/commit/577b54c7aaf1ed42947f266ae0e420ce2d979574 - see here for how to install the development version. I've never used non-default contrasts (never seen a need for them in practice?) so have never noticed this behaviour. Thanks for the report and the concise test case.

anddis commented 5 months ago

Thank you for the quick fix!

For your information: predict.flexsurvreg ignores the contrast matrix. Parameter estimates are correct now but predictions are not.

> predict(fit.noc, type = "hazard", time = 1) |> unique()
# A tibble: 4 × 2
  .eval_time .pred_hazard
       <dbl>        <dbl>
1          1      0.00442
2          1      0.0131 
3          1      0.0150 
4          1      0.00580
> predict(fit.c, type = "hazard", time = 1) |> unique()
# A tibble: 4 × 2
  .eval_time .pred_hazard
       <dbl>        <dbl>
1          1      0.00843
2          1      0.0249 
3          1      0.00967
4          1      0.00325
Warning message:
contrasts dropped from factor celltype2 
chjackson commented 5 months ago

Does https://github.com/chjackson/flexsurv/commit/640a58e13ad7422196a6ee9188e54228733a205a fix this issue for you (models will have to be refitted)? As I understand it, since changing the contrasts just results in a reparameterisation of the same model, the predictions should be the same (assuming that the optimizer identifies the correct MLE in each case)?

anddis commented 5 months ago

Yes, it does fix the prediction issue. Thank you very much! And yes again, the predictions should be the same for the very reason you mentioned.

OT: your remark about the optimizer finding the correct MLEs is interesting. I have run some simple piecewise constant hazard models and noticed that when using non-default contrast matrices the predicted hazards won't always replicate their empirical counterparts. I never noticed this behaviour with the default contrast coding. See below for a reproducible example.

For me, you can close this issue as solved. Cheers!

library(survival)
library(flexsurv)
library(MASS)
cancer$status <- cancer$status - 1
cancer.long <- survSplit(Surv(cancer$time, cancer$status) ~ ., 
                         cancer,
                         cut = c(150, 400),
                         episode = "timeband",
                         event = "status")
cancer.long$timeband <- factor(cancer.long$timeband)

fit.freg <- flexsurvreg(Surv(tstart, tstop, status) ~ timeband,
                        data = cancer.long,
                        dist = "exp")
pred.fit.freg <- unique(predict(fit.freg, type = "hazard", time = 1)$.pred_hazard)

cancer.long$timebandc <- cancer.long$timeband
contrasts(cancer.long$timebandc) <- codingMatrices::contr.diff(3)

fit.freg.diff <- flexsurvreg(Surv(tstart, tstop, status) ~ timebandc, 
                             data = cancer.long,
                             dist = "exp")
pred.fit.freg.diff <- unique(predict(fit.freg.diff, type = "hazard", time = 1)$.pred_hazard) 

ir <- tapply(cancer.long$status, cancer.long$timeband, sum) / 
  tapply(cancer.long$tstop-cancer.long$tstart, cancer.long$timeband, sum)

cbind(ir = ir, 
      pred.fit.freg =  pred.fit.freg, 
      pred.fit.freg.diff = pred.fit.freg.diff) 
           ir pred.fit.freg pred.fit.freg.diff
1 0.001530048   0.001530049        0.001531380
2 0.002918468   0.002918466        0.002919967
3 0.003303405   0.003303404        0.003302016