chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

FlexSurvMix Data Format / Model #141

Closed AustinFournierKL closed 1 year ago

AustinFournierKL commented 1 year ago

Hello,

I am currently trying to fit a model using flexsurvmix. However, I'm having a bit of a difficult time figuring out how the input is supposed to be formatted. I found this piece of documentation (https://github.com/chjackson/flexsurv-dev/blob/master/R/flexsurvmix.R#L29) but I'm having some issues interpreting it (possibly related - it seems some formatting tags attached to this doc didn't get processed properly by the website).

Mostly, I think what I'm not understanding here as where I'm supposed to put the labels for the multiple competing events. Does is replace the column that usually consists of 0s and 1s? Is it an additional column tacked on somewhere, such that for a censored case the data input would be one column saying for how long the case was observed, a label column with a NA indicating that we never saw an end to the wait and don't know which type of event would have ended up happening, and the normal event column with a 0 to indicate right censoring? In any case, an example of the syntax and the object being fed into it might be very helpful; flexsurvmix doesn't currently seem to have code examples.

Also, I'd like to double check that I'm understanding the model fit by flexsurvmix correctly. What I'm getting from reading the documentation is that flexsurvmix fits the model you'd want to use if some observations were drawn from one distribution (lets call it time to death) that's distributed Weibull(shape1, scale1), and others are from another distribution (time to recovery) that's distributed Weibull(shape2, scale2), and you simply don't know which one it's going to be until you actually observe the event. This is in contrast to other models where the event would be decided by a race between the two distributions (in which case the observed survival time for non-censored cases would be Min(Weibull1, Weibull2). Does it seem like I'm understanding this correctly?

chjackson commented 1 year ago

That interpretation sounds correct. Did you see the vignette? The data formatting is demonstrated in Section 3.2. I'll put a link to it in the help page.

AustinFournierKL commented 1 year ago

Thanks for the reply. I finally had time to take a look at it the vignette you linked and run a test.

It runs, but I'm getting an unexpected trend in the parameter estimates, so I was wondering if you could see anything wrong with what I'm doing:

library(tidyverse)
library(flexsurv)

testsamp1 = 400
testsamp2 = 800
shape1 = 2 #Death
scale1 = 5 #Death
shape2 = 3 #Recovery
scale2 = 4 #Recovery

s1 = cbind(rweibull(n = testsamp1, shape = shape1, scale = scale1), runif(n=testsamp1, min = 0, max = 6))
s2 = cbind(rweibull(n = testsamp2, shape = shape2, scale = scale2), runif(n=testsamp2, min = 0, max = 6))
sg = rbind(s1,s2)
sg = data.frame(sg)
colnames(sg) <- c("censtime", "quake_wait_time")
sg <- mutate(sg, status = as.numeric(censtime > quake_wait_time))
sg <- mutate(sg, wtime = as.numeric(ifelse(status == 1, quake_wait_time, censtime)))
sg <- mutate(sg, event = as.factor(ifelse(status == 0, NA, ifelse(row_number() <= testsamp1, "Death", "Recovery"))))
mod3 <- flexsurvmix(Surv(wtime, status) ~ 1, event = event, dists = c(Death = "weibull", Recovery = "weibull"), data = sg)
mod3

The problem with this is that the shape parameters keep getting estimated at about 1.5, weirdly consistently. For example:

Estimates: 
   component     dist  terms    est  est.t      se
1      Death           prob1  0.397     NA  0.0988
2   Recovery           prob2  0.603  0.417  0.0988
3      Death  weibull  shape  1.478  0.391  0.0536
4      Death  weibull  scale  4.210  1.437  0.0601
5   Recovery  weibull  shape  1.351  0.301  0.0409
6   Recovery  weibull  scale  3.324  1.201  0.0497

Log-likelihood = -2171.355, df = 5
AIC = 4352.71

Can you see something I'm doing wrong that could lead to this? Or this just bad luck / ML estimation being biased under some circumstances?

chjackson commented 1 year ago

Did you label the columns the wrong way round? I'd have expected the event time to be the Weibull random variable and the censoring time to be the uniform one.

AustinFournierKL commented 1 year ago

Yes, it seems that was indeed the problem! With that error in my code fixed, flexsurvmix is working as I expected, so I can now move on to applying it in practice. Thank you for the help!