DenisRustand / INLAjoint

Joint modeling multivariate longitudinal and time-to-event outcomes with INLA
16 stars 0 forks source link

Undefined columns selected with assoc="CV" #32

Closed NewUser36 closed 2 weeks ago

NewUser36 commented 3 months ago

Hi,

I am having an issue currently with the most recent version of INLAjoint (24.5.19). I am also using the most recent version of INLA (24.05.10). I have a problem when trying to do a joint model.

Here is the data for a reproducible example:

library(INLAjoint)
library(INLA)

train_cox <- structure(list(tstart = c(0, 0, 0.028, 0.306, 0.334, 0.36, 0.388, 
0.429, 0.666, 0.684, 0.72, 0.748, 1.387, 1.442, 1.652, 1.706, 
1.748, 1.802, 1.963, 2.014, 2.066, 2.108, 2.162, 2.373, 2.426, 
0, 0.306, 0.307, 0.36, 0.667, 0.348, 0.486, 0.559, 0.614, 0.698, 
0.699, 1.003, 1.058, 1.364, 1.418, 1.727, 1.779, 2.086, 2.139, 
2.445, 2.5, 2.543, 2.716, 2.797, 2.86), tstop = c(0.011, 0.028, 
0.306, 0.334, 0.36, 0.388, 0.429, 0.666, 0.684, 0.72, 0.748, 
0.839, 1.442, 1.652, 1.706, 1.748, 1.802, 1.963, 2.014, 2.066, 
2.108, 2.162, 2.373, 2.426, 2.445, 0.306, 0.307, 0.36, 0.667, 
0.672, 0.486, 0.559, 0.614, 0.698, 0.699, 1.003, 1.058, 1.364, 
1.418, 1.727, 1.779, 2.086, 2.139, 2.445, 2.5, 2.543, 2.716, 
2.797, 2.86, 3.035), event = c(1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 
0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 
0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1), var1.surv = c(0L, 
0L, 0L, 0L, 1L, 0L, 2L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 
1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 
3L), id = c("1", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", 
"3", "3", "4", "4", "4", "4", "4", "5", "5", "5", "5", "5", "5", 
"5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", 
"5")), row.names = c(NA, -50L), class = "data.frame")

train_mixed <- structure(list(tstart = c(0, 0, 0.033, 0.066, 0.098, 0.13, 0.162, 
0.195, 0.228, 0.259, 0.292, 0.487, 0.519, 0.552, 0.585, 0.616, 
0.649, 0.682, 0.714, 0.746, 0.779, 0.811, 0.844, 0.876, 0, 0.032, 
0.065, 0.098, 0.13, 0.162, 0.195, 0.227, 0.26, 0.292, 0.325, 
0.358, 0.39, 0.422, 0.455, 0.487, 0.519, 0.551, 0.584, 0.617, 
0.649, 0.681, 0.098, 0.13, 0.163, 0.196, 0.228, 0.26, 0.292, 
0.325, 0.358, 0.39, 0.422, 0.455, 0.487, 0.519, 0.552, 0.585, 
0.617, 0.649, 0.682, 0.715, 0.747, 0.779, 0.812, 0.845, 0.877, 
0.909, 0.941, 0.974, 1.007, 1.038, 1.071), var1.long = c(0, 0, 
0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 
1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 
0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 
1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1), var2.long = c(0, 0, 0, 0, 
0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 
0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 
0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 
0, 1, 0, 1, 0, 0, 0, 0, 1, 0), long1 = c(1L, 4L, 2L, 1L, 1L, 
3L, 3L, 1L, 0L, 3L, 2L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
3L, 0L, 2L, 4L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 5L, 1L, 2L, 
3L, 0L, 0L, 0L, 4L, 2L, 0L, 2L, 2L, 1L, 6L, 1L, 3L, 0L, 1L, 1L, 
1L, 3L, 3L, 1L, 2L, 0L, 1L, 5L, 2L, 2L, 1L, 1L, 7L, 3L, 5L, 3L, 
3L, 1L, 3L, 6L, 1L, 1L, 3L, 0L), my_offset = c(83L, 81L, 85L, 
81L, 79L, 71L, 86L, 70L, 73L, 90L, 90L, 74L, 81L, 84L, 83L, 84L, 
86L, 84L, 87L, 70L, 77L, 90L, 79L, 76L, 76L, 80L, 87L, 78L, 89L, 
71L, 74L, 80L, 71L, 89L, 72L, 84L, 75L, 78L, 84L, 74L, 83L, 74L, 
80L, 80L, 86L, 70L, 82L, 70L, 88L, 73L, 85L, 86L, 76L, 75L, 89L, 
85L, 71L, 82L, 72L, 79L, 75L, 76L, 86L, 70L, 70L, 72L, 80L, 82L, 
87L, 74L, 85L, 77L, 71L, 73L, 75L, 86L, 78L), id = c("1", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", 
"3", "3", "3", "3", "3", "3", "3", "3", "3", "4", "4", "4", "4", 
"4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", 
"4", "4", "4", "4", "4", "5", "5", "5", "5", "5", "5", "5", "5", 
"5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", 
"5", "5", "5", "5", "5", "5", "5", "5", "5", "5")), row.names = c(NA, 
-77L), class = "data.frame")

When using the association parameter as the current slope, there is no error.

surv_inla = inla.surv(truncation=train_cox$tstart, time=train_cox$tstop, event=train_cox$event)

jm_CS <- joint(
  formSurv = surv_inla ~ var1.surv,
  formLong = long1 ~ 1 + offset(log(my_offset)) + tstart + var1.long + var2.long + (1+tstart|id),
  dataSurv = train_cox,
  dataLong = train_mixed, 
  id = "id", 
  timeVar="tstart",
  assoc="CS",
  family = "poisson"
)

Unfortunately, if I try to do the same thing with the current value (assoc="CV"), I get the following error :

jm_CV <- joint(
  formSurv = surv_inla ~ var1.surv,
  formLong = long1 ~ 1 + offset(log(my_offset)) + tstart + var1.long + var2.long + (1+tstart|id),
  dataSurv = train_cox,
  dataLong = train_mixed, 
  id = "id", 
  timeVar="tstart",
  assoc="CV",
  family = "poisson"
)
# Error in `[.data.frame`(data_cox[[m]], , grep(modelFE[[k]][[1]][j], names(data_cox[[m]]))[1]) : 
#   undefined columns selected

Looking at the source code of the joint function, it looks like the error appears around line 950, but I was not able to correct the error.

Could you please explain what I am doing wrong or change the source code if this is a bug?

Thank you!

DenisRustand commented 3 months ago

Hi,

The issue comes from the covariates included in the longitudinal submodel "var1.long" and "var2.long" that cannot be recovered properly in the survival submodel to account for the current value of the mixed-effects model in the survival submodel. It runs fine with CS as the derivative of the linear predictor does not includes these covariates in the survival submodel.

There are two challenges here, you have multiple lines (and multiple events per individual) in the survival submodel, which is uncommon (do you have recurrent events? then you may want to include a frailty term?). Second, these covariates included in the longitudinal submodel are time-dependent. In order to properly account for these covariates in the survival submodel, I think the best option at the moment is to add them in the survival dataset, which involves adding some lines to account for their evolution over time. For example if individual 1 has an event at time 3 and two measurements of the covariates at time 0 and 2, you need a line that accounts for the [0-2] interval with right censoring and the corresponding value of the covariates and a line that accounts for the [2-3] interval with left truncation and observed event time with the new corresponding value of the covariates.

I also made some changes in the source code to avoid some issues and properly account for the offset as it was not there yet, so you have to update to the latest GitHub version of the package.

Best, Denis Rustand

NewUser36 commented 3 months ago

Hi Denis,

Thank you for answering. The covariates for the longitudinal submodel are indeed time-dependent, and my original data (as does the data in the example above) does have "artificial censoring" as you describe in your example. Regarding the recurring events, this is a problem with the data I generated to create the example, sorry about that. There is no recurring event. Here is an update for the data (I only changed the column event in train_cox):

train_cox <- structure(list(tstart = c(0, 0, 0.028, 0.306, 0.334, 0.36, 0.388, 
0.429, 0.666, 0.684, 0.72, 0.748, 1.387, 1.442, 1.652, 1.706, 
1.748, 1.802, 1.963, 2.014, 2.066, 2.108, 2.162, 2.373, 2.426, 
0, 0.306, 0.307, 0.36, 0.667, 0.348, 0.486, 0.559, 0.614, 0.698, 
0.699, 1.003, 1.058, 1.364, 1.418, 1.727, 1.779, 2.086, 2.139, 
2.445, 2.5, 2.543, 2.716, 2.797, 2.86), tstop = c(0.011, 0.028, 
0.306, 0.334, 0.36, 0.388, 0.429, 0.666, 0.684, 0.72, 0.748, 
0.839, 1.442, 1.652, 1.706, 1.748, 1.802, 1.963, 2.014, 2.066, 
2.108, 2.162, 2.373, 2.426, 2.445, 0.306, 0.307, 0.36, 0.667, 
0.672, 0.486, 0.559, 0.614, 0.698, 0.699, 1.003, 1.058, 1.364, 
1.418, 1.727, 1.779, 2.086, 2.139, 2.445, 2.5, 2.543, 2.716, 
2.797, 2.86, 3.035), event = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), var1.surv = c(0L, 
0L, 0L, 0L, 1L, 0L, 2L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 
1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 
3L), id = c("1", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", 
"3", "3", "4", "4", "4", "4", "4", "5", "5", "5", "5", "5", "5", 
"5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", 
"5")), class = "data.frame", row.names = c(NA, -50L))

train_mixed <- structure(list(tstart = c(0, 0, 0.033, 0.066, 0.098, 0.13, 0.162, 
0.195, 0.228, 0.259, 0.292, 0.487, 0.519, 0.552, 0.585, 0.616, 
0.649, 0.682, 0.714, 0.746, 0.779, 0.811, 0.844, 0.876, 0, 0.032, 
0.065, 0.098, 0.13, 0.162, 0.195, 0.227, 0.26, 0.292, 0.325, 
0.358, 0.39, 0.422, 0.455, 0.487, 0.519, 0.551, 0.584, 0.617, 
0.649, 0.681, 0.098, 0.13, 0.163, 0.196, 0.228, 0.26, 0.292, 
0.325, 0.358, 0.39, 0.422, 0.455, 0.487, 0.519, 0.552, 0.585, 
0.617, 0.649, 0.682, 0.715, 0.747, 0.779, 0.812, 0.845, 0.877, 
0.909, 0.941, 0.974, 1.007, 1.038, 1.071), var1.long = c(0, 0, 
0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 
1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 
0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 
1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1), var2.long = c(0, 0, 0, 0, 
0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 
0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 
0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 
0, 1, 0, 1, 0, 0, 0, 0, 1, 0), long1 = c(1L, 4L, 2L, 1L, 1L, 
3L, 3L, 1L, 0L, 3L, 2L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
3L, 0L, 2L, 4L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 5L, 1L, 2L, 
3L, 0L, 0L, 0L, 4L, 2L, 0L, 2L, 2L, 1L, 6L, 1L, 3L, 0L, 1L, 1L, 
1L, 3L, 3L, 1L, 2L, 0L, 1L, 5L, 2L, 2L, 1L, 1L, 7L, 3L, 5L, 3L, 
3L, 1L, 3L, 6L, 1L, 1L, 3L, 0L), my_offset = c(83L, 81L, 85L, 
81L, 79L, 71L, 86L, 70L, 73L, 90L, 90L, 74L, 81L, 84L, 83L, 84L, 
86L, 84L, 87L, 70L, 77L, 90L, 79L, 76L, 76L, 80L, 87L, 78L, 89L, 
71L, 74L, 80L, 71L, 89L, 72L, 84L, 75L, 78L, 84L, 74L, 83L, 74L, 
80L, 80L, 86L, 70L, 82L, 70L, 88L, 73L, 85L, 86L, 76L, 75L, 89L, 
85L, 71L, 82L, 72L, 79L, 75L, 76L, 86L, 70L, 70L, 72L, 80L, 82L, 
87L, 74L, 85L, 77L, 71L, 73L, 75L, 86L, 78L), id = c("1", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "3", "3", "3", "3", 
"3", "3", "3", "3", "3", "3", "3", "3", "3", "4", "4", "4", "4", 
"4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", "4", 
"4", "4", "4", "4", "4", "5", "5", "5", "5", "5", "5", "5", "5", 
"5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", "5", 
"5", "5", "5", "5", "5", "5", "5", "5", "5", "5")), row.names = c(NA, 
-77L), class = "data.frame")

The datasets have the following characteristics:

For completion's sake, if I finish the example with the same code than last time (using INLAjoint version 24.5.24):

library(INLA) # 24.05.10
library(INLAjoint) # 24.5.24
surv_inla = inla.surv(truncation=train_cox$tstart, time=train_cox$tstop, event=train_cox$event)

jm_CS <- joint(
  formSurv = surv_inla ~ var1.surv,
  formLong = long1 ~ 1 + offset(log(my_offset)) + tstart + var1.long + var2.long + (1+tstart|id),
  dataSurv = train_cox,
  dataLong = train_mixed, 
  id = "id", 
  timeVar="tstart",
  assoc="CS",
  family = "poisson"
)
#Error in `[.data.frame`(dataSurv, , names(RW[RWi])) : 
#  undefined columns selected

jm_CV <- joint(
  formSurv = surv_inla ~ var1.surv,
  formLong = long1 ~ 1 + offset(log(my_offset)) + tstart + var1.long + var2.long + (1+tstart|id),
  dataSurv = train_cox,
  dataLong = train_mixed, 
  id = "id", 
  timeVar="tstart",
  assoc="CV",
  family = "poisson"
)
#Error in `[.data.frame`(dataSurv, , names(RW[RWi])) : 
#  undefined columns selected

I now get the same error twice, saying the columns cannot be found in the survival dataset (even with assoc="CS"!).

Let's look at an hypothetical example for my use case. Consider patients who go to the doctor periodically. Every visit to the doctor generates a line in the survival dataset. Here, tstart represents the first time a person goes to the doctor, and tstop the next time there is a visit (tstart of the next visit). There is an event when the patient has developed a certain disease. At each visit, we take measure of the patient's health (var1.surv in train_cox). We want to use the previous covariates to predict if there will be a disease at tstop.

Some of the patients will visit the doctor once a month, some once every three months, others maybe twice a year when they don't feel well. For our longitudinal submodel, I want to have the same number of observations per year for each patient (I understand it might not be clear why I want that; simply assume it makes sense). Let's say we want 12 lines per patient per year, so one line every month. If patients A visits the doctor twice in January, we would need to summarize the two lines in the survival submodel in one row. If patient B does not visit in January, February and March, we would need to generate 3 rows in the longitudinal submodel while there is none for this period in the survival submodel. In our data, the second scenario happens more often, and that's why we have more rows in the longitudinal data compared to the survival data (77 rows vs 50 rows).

Finally, when we summarize or create new rows for the longitudinal data, the variables are not exactly the same than in the survival data. To see why, consider once again the example of patient A who visited the doctor twice in January. If we measure the patient's pressure every visit, we would have two rows, each with one pressure measurements. To summarise into one line in our longitudinal data, we could use the average patient's pressure and the maximum pressure in January. In this case, there would be two variables in the longitudinal submodel "linked" to one variable in the survival submodel. In the data I provided, this could be the variable var1.surv with the two variables var1.long and var2.surv. This is why the variables are not the same and I cannot simply include them in the survival dataset. I hope it makes sense.

Do you think that could be implemented in INLAjoint? I know it can be done with JMbayes2 since I already have trained a few models with my data. But since speed is an important consideration in my situation, I would prefer to use INLAjoint.

Here is the code with JMbayes2:

library(JMbayes2) # 0.4-5
cox_model <-
  coxph(
    Surv(tstart, tstop, event) ~ var1.surv, 
    data=train_cox
  )
mixed_model <-
  mixed_model(
    fixed=long1 ~ 1 + offset(log(my_offset)) + tstart + var1.long + var2.long, 
    random= ~ 1 + tstart | id,
    data=train_mixed,
    family=poisson()
)
joint_model_1 <-
  jm(
    cox_model,
    mixed_model,
    time_var = "tstart"
)
# Data Descriptives:
# Number of Groups: 5       Number of events: 3 (6%)
# Number of Observations:
#   long1: 77
# 
#                  DIC     WAIC      LPML
# marginal    290.2345 310.5340 -225.6495
# conditional 279.7788 287.6561 -146.9054
# 
# Random-effects covariance matrix:
#                     
#        StdDev   Corr
# (Intr) 0.1851 (Intr)
# tstart 0.1997 0.0687
# 
# Survival Outcome:
#                 Mean  StDev    2.5%  97.5%      P   Rhat
# var1.surv    -1.1746 1.3895 -4.3471 0.9722 0.4112 1.0050
# value(long1) -0.6233 0.4695 -1.5503 0.2727 0.2084 1.2487
# 
# Longitudinal Outcome: long1 (family = poisson, link = log)
#                Mean  StDev    2.5%  97.5%      P   Rhat
# (Intercept)  0.4049 0.2252 -0.0554 0.8323 0.0800 1.0309
# tstart       0.1694 0.3828 -0.5496 0.9559 0.6822 1.0594
# var1.long    0.0868 0.1735 -0.2496 0.4249 0.6189 1.0017
# var2.long   -0.0688 0.1753 -0.4125 0.2688 0.7096 1.0047
# 
# MCMC summary:
# chains: 3 
# iterations per chain: 6500 
# burn-in per chain: 500 
# thinning: 1

Thank you once again for your time!

DenisRustand commented 3 months ago

Hi, I'm currently out of the office, but I will look into it as soon as I return (~ a week).

DenisRustand commented 3 months ago

Hi,

So I looked into the codes and here are some comments:

When you include time dependent covariates in the linear predictor for the longitudinal submodel and share this linear predictor in the survival submodel, then the time dependent covariates are inherently included in the survival submodel through the association. This is tricky to properly handle internally (I added an explicit warning message today) and thus the best solution at the moment is to decompose the follow-up in survival data to include these time dependent covariates, then it should work fine. I'll try to make it automatic in the future but it is not trivial and much easier to just add these covariates in the dataset before the fit. Note that adding these lines in the survival dataset should not affect the survival submodel except for the association part.

Let's say for example an individual has an event at tstop=3 and var1.surv has two different values for intervals 0-1 and 1-3 while var1.long has two different values for intervals 0-2 and 2-3. You want 3 lines in the survival dataset for intervals 0-1, 1-2 and 2-3, each associated to the values of covariates of the corresponding interval, using right censoring and left truncation to properly reconstruct the follow-up as you already do for var1.surv.

You can also share only the linear combination of the random effects from the linear predictor instead of the entire linear predictor, then those time dependent covariates are not included in the survival submodel (association 'SRE' instead of 'CV'). And for "CS", the error was due to an issue with a modification I made recently, it should also work with latest version.

According to your explanations, you'll need to be careful with interpretation as var1.surv and var1.long/var2.long seems related, when you use "CV" you have an effect of the covariates in the longitudinal submodel shared in survival submodel through the association and a residual effect captured by var1.surv. Using "SRE" could simplify the interpretation as you only share the random effects (i.e., individual deviation from the population average), which adjusts the survival model on the individual heterogeneity captured by the longitudinal submodel. Whatever you want to do should be feasible but it needs to be clear and I'm happy to help.

About the code with JMbayes2, I am not sure it is doing what you think it does.

  1. While the 'mixed_model' function properly accounts for your offset, it seems to be ignored by 'jm'.
  2. Convergence criteria are not satisfying (Rhat is too high), you probably need to increase the number of iterations.
  3. I am not sure that time dependent covariates from longitudinal submodel are properly handled in the survival submodel.

Best, Denis