fauvernierma / survPen

hazard and excess hazard modelling with multidimensional penalized splines
GNU General Public License v3.0
8 stars 2 forks source link

Problems with piecewise constant models #23

Closed rsund closed 1 year ago

rsund commented 2 years ago

First, thank you for very useful and nice piece of software that provides a lot of new tools for working with hazard.based models. In one of my applications I, however, encountered a problem that I was not able to solve by myself.

To provide you a reproducible example, I tried to replicate the problem with the data generated in your own vignette. So, here is first the code borrowed directly from the vignette to create the data and estimate the models presented in the vignette (I was working with R 4.1.1 and survPen 1.5.2 in a Windows 10 environment):

n <- 10000
don <- data.frame(num=1:n)

shape_men <- 0.90 # shape for men (first weibull parameter)
shape_women <- 0.90 # shape for women

scale_men <- 0.6 # second weibull parameter
scale_women <- 0.7

prop_men <- 0.5 # proportion of men

set.seed(50)
don$sex <- factor(sample(c("men","women"),n,replace=TRUE,prob=c(prop_men,1-prop_men)))
don$sex.order <- factor(don$sex,levels=c("women","men"),ordered=TRUE)

don$shape <- ifelse(don$sex=="men",shape_men,shape_women)
don$scale <- ifelse(don$sex=="men",scale_men,scale_women)

don$fu <- rweibull(n,shape=don$shape,scale=don$scale)
don$dead <- 1 # no censoring

hazard <- function(x,shape,scale){
exp(dweibull(x,shape=shape,scale=scale,log=TRUE) - pweibull(x,shape=shape,scale=scale,log.p=TRUE,lower.tail=FALSE))
}

nt <- seq(0.01,5,by=0.1)

mar1 <- c(3,3,1.5,0.5)
mgp1 <- c(1.5,0.5,0)

lwd1 <- 2
par(mfrow=c(1,2),mar=mar1,mgp=mgp1)
plot(nt,hazard(nt,shape_women,scale_women),type="l",
xlab="time",ylab="hazard",lwd=lwd1,main="Theoretical hazards",
ylim=c(0,max(hazard(nt,shape_women,scale_women),hazard(nt,shape_men,scale_men))))
lines(nt,hazard(nt,shape_men,scale_men),col="red",lwd=lwd1,lty=2)
legend("bottomleft",c("women","men"),lty=c(1,2),lwd=rep(lwd1,2),col=c("black","red"))

plot(nt,hazard(nt,shape_men,scale_men)/hazard(nt,shape_women,scale_women),type="l",
xlab="time",ylab="hazard ratio",lwd=lwd1,
ylim=c(0,2),
main="Theoretical HR men / women")

# knots for time
knots.t <- quantile(don$fu,seq(0,1,length=10))

# stratified analysis via the two models
m.men <- survPen(~smf(fu,knots=knots.t),t1=fu,event=dead,data=don[don$sex=="men",])
m.women <- survPen(~smf(fu,knots=knots.t),t1=fu,event=dead,data=don[don$sex=="women",])

# by variable with same.rho = FALSE
m.FALSE <- survPen(~sex + smf(fu,by=sex,same.rho=FALSE,knots=knots.t),t1=fu,event=dead,data=don)

# by variable with same.rho = TRUE
m.TRUE <- survPen(~sex + smf(fu,by=sex,same.rho=TRUE,knots=knots.t),t1=fu,event=dead,data=don)

# difference smooth via ordered factor by variable
m.difference <- survPen(~sex.order + smf(fu,knots=knots.t) +smf(fu,by=sex.order,same.rho=FALSE,knots=knots.t),t1=fu,event=dead,data=don)

All those models are working just fine. In the case of my application, I though that it would be nice to check how the constant and piecewise constant hazard (as demonstrated in the beginning of the vignette with datCancer data) would look like, but I got very suspicious results. I was also able to repeat the problem with the data generated above.

In the following, the constant hazard model appears to provide quite reasonable estimates, but piecewise constant model "extends" the scale (although it seems to capture something from the shape of true hazard).

Here is the code I used to create variables, estimate models and to draw figures of the estimated hazards. I had problems to use the code in the first part of the vignette to create a variable for piecewise constants (didn't work in the model), so I created required variable(s) separately with dplyr functions.

# constant hazard by variable
m.cst <- survPen(~sex,t1=fu,event=dead,data=don,n.legendre=200)

# piecewise constant hazard (borrowed from vignette won't work)
# f.pwcst <- ~cut(fu,breaks=seq(0,5,by=0.5),include.lowest=TRUE)
# m.pwcst <- survPen(f.pwcst,t1=fu,event=dead,data=don,n.legendre=200)    # Error in f(init, x[[i]]) : non-conformable arrays

library(dplyr)
don <- don %>%
  mutate(
    fucut=factor(if_else(fu>1,1,0)),
    fucut2=factor(
      case_when(
        fu < 0.5 ~ "<0.5",
        fu < 1 ~ "0.5-1",
        fu < 2 ~ "1-2",
        TRUE ~ "2+"
      )
    )
  )

# piecewise constant hazard
# m.pwcst <- survPen(~fucut,t1=fu,event=dead,data=don,n.legendre=200)

# piecewise constant hazard by variable
m.pwcst <- survPen(~sex + fucut2,t1=fu,event=dead,data=don,n.legendre=200)

# predictions
newt <- seq(0,5,by=0.1)
data.pred <- expand.grid(fu=newt,sex=c("women","men"))
data.pred$men <- ifelse(data.pred$sex=="men",1,0)
data.pred$women <- ifelse(data.pred$sex=="women",1,0)
data.pred$sex.order <- data.pred$sex # no need to reorder here as the model keeps track of the factor's structure

data.pred <- data.pred %>%
  mutate(
    fucut=factor(if_else(fu>1,1,0)),
    fucut2=factor(
      case_when(
        fu < 0.5 ~ "<0.5",
        fu < 1 ~ "0.5-1",
        fu < 2 ~ "1-2",
        TRUE ~ "2+"
      )
    )
  )

data.pred$haz.men <- predict(m.men,data.pred)$haz
data.pred$haz.women <- predict(m.women,data.pred)$haz
data.pred$haz.FALSE <- predict(m.FALSE,data.pred)$haz
data.pred$haz.TRUE <- predict(m.TRUE,data.pred)$haz
data.pred$haz.difference <- predict(m.difference,data.pred)$haz
data.pred$haz.cst <- predict(m.cst,data.pred)$haz
data.pred$haz.pwcst <- predict(m.pwcst,data.pred)$haz

# predicting hazard
ylim1 <- c(0,max(data.pred$haz.men,data.pred$haz.women,
data.pred$haz.FALSE,data.pred$haz.TRUE,data.pred$haz.difference))

par(mfrow=c(1,2),mar=mar1,mgp=mgp1)
plot(newt,data.pred[data.pred$sex=="men",]$haz.men,type="l",main="Men",lwd=lwd1,
ylim=ylim1,xlab="time since diagnosis",ylab="hazard")
lines(newt,data.pred[data.pred$sex=="men",]$haz.FALSE,col="red",lwd=lwd1,lty=2)
lines(newt,data.pred[data.pred$sex=="men",]$haz.TRUE,col="green3",lwd=lwd1,lty=4)
lines(newt,data.pred[data.pred$sex=="men",]$haz.difference,col="orange",lwd=lwd1,lty=5)
lines(newt,data.pred[data.pred$sex=="men",]$haz.cst,col="pink",lwd=lwd1,lty=1)
lines(newt,data.pred[data.pred$sex=="men",]$haz.pwcst,col="cyan",lwd=lwd1,lty=1)
lines(nt,hazard(nt,shape_men,scale_men),col="blue3",lty=3)
legend("bottomleft",c("stratified","same.rho=FALSE","same.rho=TRUE","difference smooth","true"),lty=c(1,2,4,5,3),
col=c("black","red","green3","orange","blue3"),lwd=c(rep(lwd1,4),1))

plot(newt,data.pred[data.pred$sex=="women",]$haz.women,type="l",main="Women",lwd=lwd1,
ylim=ylim1,xlab="time since diagnosis",ylab="hazard")
lines(newt,data.pred[data.pred$sex=="women",]$haz.FALSE,col="red",lwd=lwd1,lty=2)
lines(newt,data.pred[data.pred$sex=="women",]$haz.TRUE,col="green3",lwd=lwd1,lty=4)
lines(newt,data.pred[data.pred$sex=="women",]$haz.difference,col="orange",lwd=lwd1,lty=5)
lines(newt,data.pred[data.pred$sex=="women",]$haz.cst,col="pink",lwd=lwd1,lty=1)
lines(newt,data.pred[data.pred$sex=="women",]$haz.pwcst,col="cyan",lwd=lwd1,lty=1)
lines(nt,hazard(nt,shape_women,scale_women),col="blue3",lty=3)

As can be seen from the figures, the estimates from the piecewise constant model are much larger or smaller than the true hazard or the estimates from the other models for most parts. That is not what I expected as probably those estimates should be approximately "in the middle" of true hazard for each piece. Increasing the n.legendre -parameter did not help.

Despite of that, the hazard ratio is not too far from the true value as can be seen from the following code and figures:

# predicting hazard ratio men / women

HR.stratified <- data.pred[data.pred$sex=="men",]$haz.men / data.pred[data.pred$sex=="women",]$haz.women
HR.FALSE <- data.pred[data.pred$sex=="men",]$haz.FALSE / data.pred[data.pred$sex=="women",]$haz.FALSE
HR.TRUE <- data.pred[data.pred$sex=="men",]$haz.TRUE / data.pred[data.pred$sex=="women",]$haz.TRUE
HR.difference <- data.pred[data.pred$sex=="men",]$haz.difference / data.pred[data.pred$sex=="women",]$haz.difference
HR.cst <- data.pred[data.pred$sex=="men",]$haz.cst / data.pred[data.pred$sex=="women",]$haz.cst
HR.pwcst <- data.pred[data.pred$sex=="men",]$haz.pwcst / data.pred[data.pred$sex=="women",]$haz.pwcst

par(mfrow=c(1,1))
plot(newt,HR.stratified,type="l",main="Hazard ratio, Men/Women",lwd=lwd1,
ylim=c(0,2),xlab="time since diagnosis",ylab="hazard ratio")
lines(newt,HR.FALSE,col="red",lwd=lwd1,lty=2)
lines(newt,HR.TRUE,col="green3",lwd=lwd1,lty=4)
lines(newt,HR.difference,col="orange",lwd=lwd1,lty=5)
lines(newt,HR.cst,col="pink",lwd=lwd1,lty=1)
lines(newt,HR.pwcst,col="cyan",lwd=lwd1,lty=1)
abline(h=hazard(nt,shape_men,scale_men)/hazard(nt,shape_women,scale_women),lty=3,col="blue3")
legend("bottomright",c("stratified","same.rho=FALSE","same.rho=TRUE","difference smooth","true"),lty=c(1,2,4,5,3),
col=c("black","red","green3","orange","blue3"),lwd=c(rep(lwd1,4),1)) 

I understand that survPen has not been created to primarily estimate piecewise constant models, but as those were demonstrated in the vignette and as those may sometimes be useful for various purposes, it would be nice to know what is going wrong here. I also acknowledge that the problem could be in the way I defined the model or used it to create predictions, but if that is the case, some guidance would be highly appreciated.

fauvernierma commented 2 years ago

Hi and sorry for the huge delay

The problem comes from the fact that survPen has no way to know that the variable 'fucut2' actually represents time

For the integration part to work in the likelihood computation, everytime you want to specify a function of time you have to make the 't1' variable (so here 'fu') appear explicitly inside the formula

Using the following formula should work (in your first attempt the bug comes from the fact that the follow-up exceeds 5 years)

f.pwcst <- ~cut(fu,breaks=c(seq(0,5,by=0.5),8),include.lowest=TRUE)

Have a great day !