chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

Problem with flexsurvreg Weibull AFT #139

Closed AustinFournierKL closed 1 year ago

AustinFournierKL commented 1 year ago

Hello,

I'm attempting to fit an AFT Weibull model with flexsurvreg, which, if I understand from the help page on the function, should be the default model that the function fits when you specify that you want a Weibull regression. I managed to figure out the CDF for a Weibull AFT, take its inverse, and simulate data from it. However, when I use flexsurvreg on my simulated data, it does not appear to be correctly estimating the coefficient on my covariate (shape and scale estimates, on the other hand, seem quite accurate).

That said, I am not sure whether I am perhaps making a mistake with the model fitting procedure, or perhaps misinterpreting the outfit window. My code was as follows:

set.seed(1)
#sample size
testsamp = 400
#weibull default parameters
shape = 2
scale = 5
#generating the covariate and its effect. 40% of datapoints should come from
#an "accelerated" weibull, representing a place where the wait time to an 
#earthquake is half as long
percent_scary = .4
coeff = log(2)
covsamp = rbinom(testsamp,1,percent_scary)
c= coeff * covsamp

#initial model with no coveriate used to test the model fitting procedure generally
quake_wait_time = rweibull(n = testsamp, shape = shape, scale = scale)
censtime = runif(n = testsamp, min = 0, max = 6)
event = censtime > quake_wait_time
cyframe <- data.frame(cbind(quake_wait_time,censtime,event))
cyframe <- mutate(cyframe, time = as.numeric(ifelse(event == 1, quake_wait_time, censtime)))
head(cyframe)
hist(cyframe$quake_wait_time)
hist(cyframe$time)
mod1 <- flexsurvreg(formula = Surv(time = time, event = event) ~ 1, data = cyframe, dist = "weibull")
mod1

#version two of the model, where inverse transform sampling is used to generate
#data according to the distributional assumptions of a weibull AFT
usamp = runif(testsamp, min = 0, max = 1)
AFTsamp = (((-scale^shape)*log(1-usamp))/(exp(c*shape)))^(1/shape)
event = censtime > AFTsamp
cyframe2 <- data.frame(cbind(covsamp,AFTsamp,censtime,event))
cyframe2 <- mutate(cyframe2, time = as.numeric(ifelse(event == 1, AFTsamp, censtime)))
#checks on the data transformation
head(cyframe2)
hist(cyframe2$AFTsamp)
#model fitting, attempting to use covsamp as the control variable in a Weibull
#AFT
mod2 <- flexsurvreg(formula = Surv(time = time, event = event) ~ covsamp, data = cyframe2, dist = "weibull")
mod2

Here, the final model I got returned:

Estimates: 
         data mean  est      L95%     U95%     se       exp(est)  L95%   
shape         NA     1.8651   1.6595   2.0960   0.1111       NA        NA
scale         NA     4.7404   4.2016   5.3483   0.2918       NA        NA
covsamp   0.3700    -0.5853  -0.7512  -0.4193   0.0847   0.5570    0.4718
         U95%   
shape         NA
scale         NA
covsamp   0.6575

N = 400,  Events: 169,  Censored: 231
Total time at risk: 814.5138
Log-likelihood = -378.7206, df = 3
AIC = 763.4412

So the estimate for the coefficient on on covsamp would seem to be -0.5853, and when exponentiated it comes out to ~.557. In contrast, I was expecting a coefficient of ~.69, which would come out to around 2 when exponentiated. What could be going on here?

In case this part of my code is suspect, I will note that the CDF on which I based my random sampling was: 1 - exp(-(((t^shape)*(exp(c*shape)))/scale**shape)) In the above equation, shape and scale are exactly what they say on the label, t is time, and c is the sum of all covariates times their coefficients. I tested and confirmed that if c = log(2), this expression will return the same probability as the function below, at half the value of t: pweibull(t, shape = shape, scale = scale) Likewise, if I set c to log(3), it would gave the same results at one third the value of t.

With this confirmed I took the inverse function: (((-scale^shape)*log(1-prob))/(exp(c*shape)))^(1/shape) In the above, prob is the input to the function; I tested and confirmed that it would give me back the correct value of t when I put in the output from CDF shown above.

AustinFournierKL commented 1 year ago

On further review of the results, it looks to me like maybe the issue here is that I implemented the AFT in such a way that I have to take the negative of the coefficient flexsurvreg estimates to get the correct coefficient for my equation. Can you tell if that's correct?

chjackson commented 1 year ago

There's a lot of detail in your post, so it's hard to tell instantly, but that's probably what is happening. The model that flexsurvreg fits here is fully specified in https://cran.r-project.org/web/packages/flexsurv/vignettes/distributions.pdf so you could cross-check with that.

AustinFournierKL commented 1 year ago

Okay, the difference between my final equation and yours was more or less as I expected. On that note, what I wrote above was in error on one point; for my equation to be correct, c would be the negative of the sum of all covariates times their coefficients; I seem to have dropped the negative at some point. The resulting equation was still an AFT, but parameterized slightly differently than originally intended, and thus different from what flexsurv assumed.