chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

simulate implementation #138

Closed mclements closed 1 year ago

mclements commented 1 year ago

Chris,

I would like to implement simulate for different survival models for the forthcoming RHTA book. I have drafted implementations of simulate.stpm2 and simulate.survreg. I then drafted simulate.flexsurvreg -- and noticed that you .gitignore'd for R/simulate* etc.

Three questions: first, have you already written this function? Second, is the following function useful? Third, do you agree with the use of a t0 argument for delayed entry?

As an aside, I like the addition of the cross argument in the summary.flexsurvreg function.

Sincerely, Mark Clements.

#' Simulate event times from a flexsurvreg object
#' @param object flexsurvreg object
#' @param nsim number of simulations per row in newdata
#' @param seed random number seed
#' @param newdata data-frame for defining the covariates for the simulations. Required.
#' @param t0 delayed entry time. Defaults to NULL (which assumes that t0=0)
#' @param ... other arguments (not currently used)
#' @return vector of event times with nsim repeats per row in newdata
#' @importFrom stats simulate
#' @rdname simulate
#' @export
#' @examples
#' fit <- flexsurvreg(formula = Surv(futime, fustat) ~ rx, data = ovarian, dist="weibull")
#' fit2 <- flexsurvspline(formula = Surv(futime, fustat) ~ rx, data = ovarian, k=3)
#' nd = data.frame(rx=1:2)
#' simulate(fit, seed=1002, newdata=nd)
#' simulate(fit, seed=1002, newdata=nd, t0=500)
#' simulate(fit2, nsim=3, seed=1002, newdata=nd)
#' simulate(fit2, nsim=3, seed=1002, newdata=nd, t0=c(500,1000))
simulate.flexsurvreg =
    function(object, nsim=1, seed=NULL, newdata, t0=NULL, ...) {
        stopifnot(is.data.frame(newdata))
        if (!is.null(seed)) set.seed(seed)
        nd = nrow(newdata)
        n = nd*nsim
        if (!is.null(t0)) {
            stopifnot(length(t0) %in% c(1, nd))
            if (length(t0)==1) t0=rep(t0,nd)
            S0 = summary(object, newdata=newdata,
                         type="survival", t=t0,
                         cross=FALSE, ci=FALSE, se=FALSE, tidy=TRUE)$est
            F0 = 1-S0
            F0 = rep(F0, each=nsim)
        } else F0 = rep(0,n)
        U = runif(n, F0, 1)
        newdata = newdata[rep(1:nd, each=nsim), , drop=FALSE]
        summary(object, newdata=newdata, 
                type="quantile", quantiles=U,
                cross=FALSE, ci=FALSE, se=FALSE, tidy=TRUE)$est
    }
chjackson commented 1 year ago

Hi Mark - yes, I had done a rough version for my own use, but never got around to polishing it for the package. Your version is helpful for that because it implements it in a more sensible way than I had done (wrapping summary() rather than using the r function corresponding to the fitted distribution. this makes newdata easier to handle). I'll polish this and put it in the package - thanks.

mclements commented 1 year ago

Chris,

This is this good news.

Regarding my third point: do you agree with the use of the t0 argument for delayed entry?

On a related point: I had an interesting discussion with Terry Therneau about the use of left truncation with AFT models (the following doesn't apply for the Royston-Parmar models). Terry has deliberately not implemented left truncation in survival::survreg because it can be wrong. Following Cox and Oakes, $$h(t|x) = h_0\left(\int_0^t \exp( \beta^T x(u)) du\right) \exp(\beta^T x(t))$$ where $x(t)$ encodes for a time-varying exposure, rather than a time-varying effect. To calculate $$t^*=\int_0^t \exp(\beta^T x(u)) du$$ we need the full exposure history between 0 and $t$. If the covariate is constant over time, then I believe that the standard log-likelihood formulation for left truncation holds, such that we can add the cumulative hazard at the delayed entry time to condition on survival to that time. In general, one could warn about possible model misspecification with delayed entry data. Alternatively, one could (a) include an id for individuals in the fit, (b) warn if an individual has any gaps in their exposure history, and (c) calculate the likelihood based on their exposure history. I don't know whether any AFT implementations follow the second approach. Your thoughts?

chjackson commented 1 year ago

It's fine to have an argument for delayed entry. Though I'd call it start rather than t0 for consistency with the name of the argument to summary.flexsurvreg. The implementation could also be made simpler by getting the quantiles of the truncated distribution with summary(..., start=t0, type="quantile") - though I just noticed the implementation of that was wrong - I think this is now fixed in https://github.com/chjackson/flexsurv-dev/commit/5111d20c6a325dcdf3d2d2c808709e6cc2bd3d91 .

That is a useful warning about time-dependent covariates. Support for time-dependent covariates is patchy in flexsurv - I'll add this warning to the vignette section on model fitting:

"Whether this is a valid approach in \pkg{flexsurv} depends on whether the probability of survival up to the left-truncation time can be represented by a term of the form $S_i(s_i) = exp(-H_i(s_i))$ in the likelihood (equation 2). The cumulative hazard $H$ over the interval from time 0 to time $s_i$ depends on how the covariates change on this time interval. If the covariates are constant between time 0 and time $s_i$, or if covariates are modelled with proportional hazards, then this cumulative hazard is $H_i(s_i)$, hence the likelihood used by \pkg{flexsurv} is valid. It is not valid in general however for other forms of dependence on covariates, e.g. accelerated failure time models."

chjackson commented 1 year ago

I've implemented a simulate.flexsurvreg function in this commit https://github.com/chjackson/flexsurv-dev/commit/b62bb57e26b385a8abc28f250b7220c5a6cbb1c1 . I've added a few features and defaults compared to your version. Does it look as if it does the right thing to you, and fit with what you were doing for for the RHTA example?

mclements commented 1 year ago

Chris,

Your code looks good. I have also adjusted simulate in rstpm2 to use the start argument.

Thank you for looking at this.

Sincerely, Mark.

chjackson commented 1 year ago

Thanks, Mark