drizopoulos / JM

Joint Models for Longitudinal & Survival Data under Maximum Likelihood
34 stars 7 forks source link

`survfitJM` expects newdata to be ordered by subject, and by time within each subject, else things go wrong #40

Open erikvona opened 2 years ago

erikvona commented 2 years ago

As said, survfitJM expects the data frame newdata to be ordered by subject, and by time within each subject, else strange things happen. This is not documented, as far as I could find.

An example based on chapter 7 of Joint Models for Longitudinal and Time-to-Event Data with Applications in R:

The code:

# Load packages JM and lattice
library("JM")
library("lattice")
set.seed(123) # we set the seed for reproducibility, early to get reproducible scrambling
# scramble rows
pbc2 <- pbc2[sample(seq_len(NROW(pbc2))), ]

# indicator for the composite event for the PBC dataset
pbc2$status2 <- as.numeric(pbc2$status != "alive")
pbc2.id$status2 <- as.numeric(pbc2.id$status != "alive")

# indicator for abnormal prothrombin time
pbc2.id$Pro <- with(pbc2.id, factor(prothrombin >= 10 & prothrombin <= 13,
                                    labels = c("Abnormal", "Normal")))
pbc2$Pro <- rep(pbc2.id$Pro, tapply(pbc2$id, pbc2$id, length))

#################
# Section 7.1.3 #
#################

lmeFitBsp.pbc <- lme(
  fixed = log(serBilir) ~ bs(year, 4, Boundary.knots = c(0, 15)),
  random = list(
    id = pdDiag(form = ~ bs(year, 4, Boundary.knots = c(0, 15)))),
  data = pbc2)

coxFit.pbc <- coxph(Surv(years, status2) ~ drug + Pro,
                    data = pbc2.id, x = TRUE)

jointFitBsp.pbc <- jointModel(lmeFitBsp.pbc, coxFit.pbc,
                              timeVar = "year", method = "piecewise-PH-aGH")

# conditional survival probabilities for Patient 2 using Monte Carlo

survPrbs <- survfitJM(jointFitBsp.pbc, newdata = pbc2[pbc2$id == 2, ])
survPrbs
survPrbs$last.time[[1]] # 0
max(survPrbs$obs.times[[1]]) # but actually 8.8

plot(survPrbs, include.y = TRUE) # See image below, longitudinal outcome goes through survival plot
# estimate for longitudinal outcome is a line that goes back and forth through time, crossing itself
# When not including y, it may not be obvious that these predictions are incorrect

image

And when we estimate for multiple subjects, it goes even more wrong

survPrbsAll <- survfitJM(jointFitBsp.pbc, newdata = pbc2[pbc2$id %in% 1:5, ])
length(survPrbsAll$fitted.y[[1]]) # 7 ys fitted for first subject
length(survPrbsAll$fitted.times[[1]]) # However, these are fitted on 3 timepoints
plot(survPrbs, include.y = TRUE) # And we can't plot that

A workaround would be to set the order of newdata within survfitJM.jointModel:

newdata <- newdata[order(newdata[[idVar]], newdata[[object$timeVar]]), ]

Or to check order and error if it's not ordered properly:

stopifnot("newdata must be ordered by subject, and by time within each subject" = identical(order(newdata[[idVar]], newdata[[object$timeVar]]), seq_len(NROW(newdata)))

Alternatively, the code could be rewritten to no longer depend on ordering, but that's out of my league.

I can submit a single-line pull request for either of these fixes if that helps, however, I'm unsure if other functions are affected by the same behaviour.