Adding accelerated failure time modelling to primarycensoreddist

SamuelBrand1 commented 2 weeks ago

What is this issue?

This issue is about discussing exposing accelerated failure time models in primarycensordist. The goal is to be as flexible as https://package.epinowcast.org/articles/model.html#generalised-model-1 Gunther et al, but to integrate directly with the work done on survival functions here.

Accelerated failure time (AFT) models

The basic idea of AFT survival models is that we start with a baseline delay model with hazard function for time since primary $t$: $\lambda_0(t)$ and survival function $S_0(t)$. Then we treat effects as either accelerating (i.e. secondary event arrives relatively sooner) or decelerating (later) the time until secondary event. For fixed effects:

\begin{aligned}
\lambda(t | \theta) &= \theta \lambda_0 (\theta t),\\
S(t | \theta) &= S_0(\theta t), \\
\theta &= \exp\left(X \beta \right).
\end{aligned}

Fixed effect on the hazard can be extended time-varying effects (e.g. a particular time period has decelerated hazard in reporting) (cf. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10346080/pdf/kxac009.pdf ). Rewriting the linked paper in approx common notation:

\begin{aligned}
H(t) &= \int_0^t \theta_s ds, \\
\lambda(t | (\theta_s)_{s\geq0}) &= \theta_t \lambda_0 (H(t)),\\
S(t | (\theta_s)_{s\geq0}) &= S_0(H(t)), \\
\theta_t &= \exp\left(X \beta + \eta_t\right).
\end{aligned}

Possible usage in `primarycensordist`

We have access to various analytic expressions for the survival function of censored delays (e.g. $S_0$ in notation above). Therefore, given a calculation of the cumulative hazard $H(t)$, the probability of a secondary time in $[T, T+ w_S]$ in the AFT framework is $S_0(H(T)) - S_0(H(T+w_S))$, which is potentially not that hard to calculate (maybe?) and avoids having to renormalize as per the Gunther discrete time model.

The three hazard aspects consider here https://package.epinowcast.org/articles/model.html#generalised-model-1 seem to be able to be address pretty well in this framework.

Parametric baseline hazard: We have the baseline hazard/survival func in many cases as a function of its distribution parameters. That makes date of reference regression on them relatively more convenient.
Non-parametric reference date effect: This can be modelled as a fixed effect on delays originating within a primary censor window of the reference date (e.g. the $X\beta$ term above).
Non-parametric report date effect: This can be modelled as a time-varying effect which is constant on report dates.

This is not very different from the Gunther model, but it brings things into a common continuous time framework and potentially eliminates normalisation. Might be a bit more efficient (I'm not sure though).

SamuelBrand1 commented 2 weeks ago

PS I didn't have the same stats vocabulary but its the underlying modelling approach me and @MattKeeling used for temperature driven event modelling for biting midges https://royalsocietypublishing.org/doi/full/10.1098/rsif.2016.0481

The code is fairly horrible MATLAB. But the underlying computational idea is that for any given model parameters you can do one sweep of $H(t)$ relative to some start time before the data, cache this as an array, then you can get the relevant values of $H(t)$ for specific delays as differences along points of the array.

e.g.

The accumulated hazard between $t_1$ and $t_2$:

H(t_1,t_2) = \int_{t_1}^{t_2} \theta_s ds = H(0, t_2) - H(0, t_1).

seabbs commented 2 weeks ago

My first pass take is I think expressing the primary censored distributions as disrete hazards seems very in scope here if we first do some thinking about whether or not we expect it to be more numerically efficient/stable vs going log pmf -> log inverse logit hazard (which is the scale everything in say epinowcast and other Gunther like nowcast approaches operate on if being try hard).

I assume that this will be the case as we need the ccdf (i.e. see https://github.com/epinowcast/epinowcast/blob/12c24bf509704c4384902408f36dff86e73fef5b/inst/stan/functions/discretised_logit_hazard.stan#L195).

Non-parametric reference date effect

Non-parametric report date effect

I'm less convinced that these ideas live here and/or approaching this way provides substantial benefits vs the discrete hazard approach take in epinowcast though open to being convinced. I think even if convinced I would push for hazard tooling that doesn't work directly on the dists to be out of scope and supported downstream or in another module.

SamuelBrand1 commented 2 weeks ago

The approach in epinowcast documentation is a proportional hazards approach, i.e. it has a regression directly on the logit(hazard) that represent shifting the hazard by a log odds-ratio.

The infrastructure you have built in primarycensoreddist lends itself towards a AFT approach because we have the ccdf functions directly. In terms of compute, in either approach we need to calculate the hazard/time acceleration effect sizes. In the epinowcast approach we also need to work with the normalisation (e.g. we can't use a spare data trick to calc only parts of the log-pmf because the norm factor also depends on covariates); with AFT we don't need that.

Proportional hazards and AFT aren't equivalent, and its a bit of design decision to make (e.g. this discussion https://stats.stackexchange.com/questions/621041/choice-between-proportional-hazard-and-accelerated-failure-time-survival-models). The historical preference for proportional hazard survival analysis is, at least partly, influenced by the ability to scale out the baseline hazard rate and concentrate on the covariate effect sizes. I'm not sure thats such a relevant consideration here.

From Cox himself https://estadisticafciencias.wordpress.com/wp-content/uploads/2019/08/cox_d._r_analysis_of_survival_dataz-lib.org_.pdf Chapter 5 is useful reference.

seabbs commented 2 weeks ago

I think in epinowcast you can implement non-proportional hazards using the non-parametric component of the reference module but I could be wrong.

(e.g. we can't use a spare data trick to calc only parts of the log-pmf because the norm factor also depends on covariates); with AFT we don't need that.

This could potentially be a big advantage

epinowcast / primarycensored