chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

Negative hazard from hsurvspline #113

Closed Hark-Jin closed 2 years ago

Hark-Jin commented 2 years ago

Hi there,

I've encountered an issue when trying to calculate the raw hazard from a spline model using the hsurvspline function. In certain scenarios (3 knots in my analysis) where t is very small, sometimes we could get negative raw hazard from the function. Just wondering if that's possible to get a raw hazard that is negative.

Any explanation or suggestion would be much appreciated! Thanks


Example here:

hazv[[k]][j,]=hsurvspline(time, gamma=alpha,knots = dataset2$knots_logtime,scale = paste0(scenario))

time [1] 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 [20] 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90

alpha [1] -4.047342 2.939003 -1.326890 3.381928 -2.590202

dataset2$knots_logtime [1] -1.113090 1.003165 1.451829 2.120673 3.261408

scale [1] "odds"

Example of negative hazard: hazv[[2]][65,] [1] -5.257192e-02 -2.475285e-02 -1.593112e-02 -1.165337e-02 -9.143578e-03 -7.499765e-03 -6.195762e-03 -4.259204e-03 [9] -2.067614e-03 9.852299e-05 2.142156e-03 4.040872e-03 5.801855e-03 7.442824e-03 8.983965e-03 1.044465e-02

chjackson commented 2 years ago

Could you make the example reproducible please? I can't tell from this exactly what times, parameter values, knots and scale produced that negative hazard.

Hark-Jin commented 2 years ago

Hi Chris,

Thanks so much for your quick reply. Please see attached for an example below, feel free to let me know if you have any comment or suggestion!

Input data: time=seq(0.05,0.5,0.05) alpha = c(-3.76245380,-0.08924915,-2.97106783,5.08610636,-2.84812296) knots_time = c(-1.113090,1.003165,1.451829,2.120673,3.261408)

Output hazard: hsurvspline(time, gamma=alpha,knots = knots_time,scale = 'odds') [1] -5.257192e-02 -2.475285e-02 -1.593112e-02 -1.165337e-02 -9.143579e-03 -7.499765e-03 -6.195764e-03 -4.259212e-03 [9] -2.067629e-03 9.850067e-05

chjackson commented 2 years ago

Thanks, OK I think I can guess what is happening. The Royston-Parmar model defines the log cumulative hazard function as a spline function of log time. The problem is that a cumulative hazard function must be a non-decreasing function of time, but there is no easy way of defining mathematically what set of spline coefficients correspond to valid cumulative hazard functions. So some values of gamma will give models that don't make sense for all possible times.

dsurvspline sets the PDF to zero for invalid values of gamma. This is used when fitting models, so the likelihood of an observed dataset should be zero if the coefficients are invalid. For consistency, I should probably change hsurvspline to behave in the same way.

I am not sure whether NaN would be a more appropriate value to return. It's a fundamental problem with this class of models. Models defined with spline functions of the log hazard (see the vignette) are better-behaved in this respect, but they are slower to fit because the likelihood needs numerical integration to calculate.

Hark-Jin commented 2 years ago

Thanks so much for this clarification, I really appreciated it! The explanation is super helpful.

I agree that NaN might be a more appropriate value to return if we have input invalid gamma coefficients in some extreme scenarios.