chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
54 stars 28 forks source link

Cumulative incidence function from cause-specific flexsurv models #90

Closed mattwarkentin closed 3 years ago

mattwarkentin commented 3 years ago

Hi @chjackson,

I am wondering if there is currently any functionality to obtain the cumulative incidence function in the presence of competing risks when one uses flexsurvreg() or flexsurvspline() to fit cause-specific hazards models. I don't think such a function exists in flexsurv, unless I've missed it. Maybe something exists in another package that I'm not aware of.

I guess more broadly, what, if anything, is currently missing from flexsurv to be able to "follow along" with an example such as the one presented here, which is implemented in Stata using stpm2 and stpm2cif?

Specifically, in the linked paper, the hazard functions in Eq. (4) could come from flexsurv, right? The predict.flexsurvreg and summary.flexsurvreg methods already provide tools for obtaining hazard functions, so would we just need a function that essentially implements Eq. (4) (shown below)?

Screen Shot 2021-01-12 at 6 26 46 PM

Many months ago I hacked something together to do just this, but I could certainly clean it up and spin up a PR, if you would be amenable to it. I look forward to hearing from you.

I could envision an API something like:

mod1 <- flexsurvreg(Surv(time, status==1) ~ x + y + z, data = dd, dist  = "gengamma")
mod2 <- flexsurvreg(Surv(time, status==2) ~ x + y + z, data = dd, dist  = "gengamma")

# Assemble an arbitrary set of N cause-specific models
set <- assemble(mod1, mod2)

# predict method for obtaining CIF from set
predict(set, ...)
chjackson commented 3 years ago

As I understand it, this quantity is the probability that someone has had event k by time t. This is what pmatrix.fs does, but in the more general situation of continuous-time multi-state models. For a competing risks model, this is the transition probability between the "starting state" and "had event k" over a time interval t. It's computed using numerical ODE solver, so perhaps this is overkill for competing risks models if the formula (4) above is all that's needed - this can be taken directly from the hazards and cumulative hazards.

mattwarkentin commented 3 years ago

So if I understand correctly, since an absorbable competing risk model where a subject can transition to only one of k mutually exclusive events is just a special case of a multi-state model, you can fit a competing risk model with flexsurvreg() and obtain the cumulative incidences for time t using pmatrix.fs()?

I guess the downside of this approach is you are constrained to using the same distribution and the same set of variables for all events. Right? I guess you could create interactions between the transition ID and variables to have event-dependent covariate effects.

Whereas in the original post, you could theoretically fit separate cause-specific models with flexsurvreg() or flexsurvspline() with different variables and different distributions, and the CIF could be obtained with (4). Am I on the right track?

chjackson commented 3 years ago

Different distributions are allowed for different transitions in the devel version. There's a function fmsm to group together different fitted models for different transitions to form a multi-state model. pmatrix.fs and other summary functions can be used on fmsm objects. You could mix spline and non-spline models.

chjackson commented 3 years ago

See also the new vignette on multi-state modelling https://github.com/chjackson/flexsurv-dev/blob/master/vignettes/multistate.pdf

mattwarkentin commented 3 years ago

Amazing! I think fmsm amd pmatrix.fs connect all the dots. Thanks for sharing that vignette, it looks extremely useful and I look forward to reading it over.

Do you know when you might submit the current devel version to CRAN?

chjackson commented 3 years ago

Not yet. I've been actively working on some multi-state model stuff recently, so I've been cautious at declaring the new features "complete". However if the CRAN folks ask for a new version to make it work with an updated version of R , as they sometimes do, then I'll push the new version straight away.

mattwarkentin commented 3 years ago

Thanks again for the reply. The vignette was very helpful.

I'm not sure if this is the right place to ask this question, but in following with our discussion about fitting cause-specific hazards models and then getting cumulative incidences/transition probabilities, I wanted to get your thoughts on whether the following simple example looks right to you.

I am interested in using age as the time scale, rather than zero-based follow-up time. I fit separate models for cause 1 and 2 with a single predictor X. And lastly, I want to predict transition probabilities at, say, 2 years conditional on age at entry (i.e. based on your age at entry and covariate X, what is your risk of cause k in 2 years?).

library(flexsurv)
#> Loading required package: survival
library(tidyverse)

set.seed(12345)
data <-
  tibble(
    age_enter = rnorm(100, 60, 5),
    age_exit = age_enter + abs(rnorm(100, 5)),
    event = sample(0:2, 100, replace = TRUE),
    x  = rnorm(100)
  )

mod1 <- flexsurvreg(Surv(age_enter, age_exit, event==1) ~ x,
                    data = data, dist = "gengamma")

mod2 <- flexsurvreg(Surv(age_enter, age_exit, event==2) ~ x,
                    data = data, dist = "gengamma")

tmat <- rbind(
  c(NA, 1, 2),
  c(NA, NA, NA),
  c(NA, NA, NA)
)

both <- fmsm(mod1, mod2, trans = tmat)

data %>% 
  mutate(
    # Everyones age plus 2 years
    age_plus2 = age_enter + 2,
    # Iterate over X and age_plus2 and extract risk of 0->1 transition prob.
    prob = map2_dbl(
      age_plus2, x,
      ~ pmatrix.fs(both, tmat, t = .x, newdata = tibble(x = .y))[1, 2]
      )
  )
#> # A tibble: 100 x 6
#>    age_enter age_exit event       x age_plus2  prob
#>        <dbl>    <dbl> <int>   <dbl>     <dbl> <dbl>
#>  1      62.9     68.2     2  0.876       64.9 0.349
#>  2      63.5     67.4     1 -0.369       65.5 0.405
#>  3      59.5     64.9     0  0.391       61.5 0.327
#>  4      57.7     61.4     2  0.928       59.7 0.292
#>  5      63.0     68.2     0  1.37        65.0 0.331
#>  6      50.9     55.4     0 -0.238       52.9 0.210
#>  7      63.2     67.8     1 -0.468       65.2 0.404
#>  8      58.6     65.2     1  0.0940      60.6 0.325
#>  9      58.6     63.1     1  0.0951      60.6 0.324
#> 10      55.4     60.7     0  0.313       57.4 0.273
#> # … with 90 more rows
chjackson commented 3 years ago

Can't see anything wrong with the syntax for the model you intended, on a very superficial glance!

I think this models the time from age zero to the event as a generalized gamma, and the time to event is left truncated at the start of each person's follow up. I haven't done much of this kind of model with age as a timescale. I wonder if you would need to have observations from a wide range of ages including near zero, to be able to identify the shape of the distribution? By the looks of it your data contain only older people.

mattwarkentin commented 3 years ago

Thanks for looking it over.

I haven't done much of this kind of model with age as a timescale. I wonder if you would need to have observations from a wide range of ages including near zero, to be able to identify the shape of the distribution? By the looks of it your data contain only older people.

Thats a good question. In all of the examples I've seen which used age as the timescale, it was always the case that it was an aged population with everyone 40+. In that case, birth is the time of origin or when a person becomes at-risk, however we only observe their risk when they enter the study at baseline (i.e. enter the risk set).

mattwarkentin commented 3 years ago

Hi @chjackson,

I have a question that I am hoping I could get your thoughts on.

When using flexsurvspline(), I get to to jointly estimate both the hazard ratios and the baseline hazard. If, let's say, I had some evidence to suggest that the baseline hazard in my data (and as such, the model-estimated hazards) are biased downwards by some known factor, is there a way to use this factor to adjust the model-estimated hazards by this factor? Is this sort of information only able to be used post-hoc or can it somehow be useful during fitting?

I suppose the answer is, if you think your data is biased for the question at hand, don't use it, but let's ignore that for discussions sake!

chjackson commented 3 years ago

With the bhazard argument you can express the hazard as a known population hazard plus a fitted "excess" hazard from the parametric model - look up "relative survival" methods. But that sounds slightly different from what you want, which is to adjust the hazard from the data to represent a different population. I think it depends how you parameterise the bias - e.g. how does the bias vary as a function of time and with covariates? If it can be expressed so you can adjust post hoc, then that'd be easiest.

mattwarkentin commented 3 years ago

Hi @chjackson,

I don't know if this is the right place to ask this question but I am wondering about how to use the condstates argument for pmatrix.fs(). The documentation for the argument does not make it clear what type of input it expects:

Instead of the unconditional probability of being in state s at time t given state r at time 0, return the probability conditional on being in a particular subset of states at time t. This subset is specified in the condstates argument. This is used, for example, in competing risks situations, e.g. if the competing states are death or recovery from a disease, and we want to compute the probability a patient has died, given they have died or recovered. If these are absorbing states, then as t increases, this converges to the case fatality ratio. To compute this, set t to a very large number, Inf will not work.

As a follow-up, it is my understanding that pmatrix.fs() returns the probabilities of being in state s at time t given that a person was in state r at time 0. For a competing risk model, we can assume r is some healthy/baseline state, and s is any one of the absorbing states.

If I was instead interested in the conditional probability of being in state s at time t given that a person was in state r at time t0, where t0 could be any time before time t, is this achievable with pmatrix.fs(). In other words, the probability in being in some state at time t given you were healthy up to time t0. I was unsure whether I needed to use condstates, or simply just subtract the estimate at time t0 from the estimate at time t.

mattwarkentin commented 3 years ago

Just in case the above was unclear, I believe pmatrix.fs() gives:

But I am interested in: Screen Shot 2021-03-11 at 4 29 33 PM

I believe subtracting pmatrix.fs() at time t0 from pmatrix.fs() at time t0 + 2 give this quantity, but wanted to get your thoughts.

chjackson commented 3 years ago

condstates is a vector of character labels or integers indicating the states to condition on. I'll clarify the doc.

For your case, condstates won't work, as it conditions on being in a subset of states at time t, where as you want to condition on being in one specific state at time t0. For a competing risks model with no onward transitions as you say, it looks like you can subtract as you say to get the formulae you are after. But I've never used this personally.