chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

Bare minimum flexsurvreg object for making predictions #127

Closed mattwarkentin closed 2 years ago

mattwarkentin commented 2 years ago

Hi @chjackson,

I am trying to package a flexsurvreg object into an R package in order to share my model for others to use. However, I must strip the object of any raw data since we are not allowed to share this data. Brute force removal of the data from the object triggers an error:

# x is a flexsurvreg object
x$data <- NULL

predict(data, ...)
Covariate contrasts matrix `X` should be supplied explicitly if the data are not included in the model object

Any suggestions for how to strip the data from the model object without affecting the integrity of the model to make predictions?

mattwarkentin commented 2 years ago

Here is a minimal example:

library(flexsurv)
#> Loading required package: survival
x <- flexsurvreg(Surv(time, status) ~ 1, data = cancer, dist = 'exp')
x$data <- NULL
summary(x)
#> Error in newdata_to_X(x, newdata, X, na.action): Covariate contrasts matrix `X` should be supplied explicitly if the data are not included in the model object
mattwarkentin commented 2 years ago

The relevant code: https://github.com/chjackson/flexsurv-dev/blob/c8e2deb570a1ddd8ab0901994574753e33731fae/R/summary.flexsurvreg.R#L214-L215

mattwarkentin commented 2 years ago

While I'm thinking about this - in addition to the data privacy point, when the data are "large", they make up an enormous part of the model object size, which also makes it difficult to package and share:

library(flexsurv)
#> Loading required package: survival
library(purrr)

d <- data.frame(
  time = rexp(1e7),
  status = sample(0:1, 1e7, TRUE)
)

x <- flexsurvreg(Surv(time, status) ~ 1, data = d, dist = 'exp')

map_dbl(x, object.size) %>% 
  sort(decreasing = TRUE) %>% 
  map_chr(prettyunits::pretty_bytes)
#>           data        logliki           dfns          dlist            opt 
#>      "2.08 GB"    "720.00 MB"    "208.19 kB"     "19.33 kB"      "2.02 kB" 
#> concat.formula   all.formulae           call            res          res.t 
#>      "1.33 kB"      "1.29 kB"      "1.23 kB"        "840 B"        "840 B" 
#>            cov             mx          ncovs       ncoveffs       basepars 
#>        "624 B"        "328 B"         "56 B"         "56 B"         "56 B" 
#>            AIC              N         events          trisk   coefficients 
#>         "56 B"         "56 B"         "56 B"         "56 B"         "56 B" 
#>          npars        optpars         loglik             cl      datameans 
#>         "56 B"         "56 B"         "56 B"         "56 B"         "48 B" 
#>            aux        covpars      fixedpars 
#>          "0 B"          "0 B"          "0 B"

Being able to strip out data and logliki would be great for slimming down the object, if it didn't break downstream functionality (i.e. predictions).

chjackson commented 2 years ago

The error message is telling you something here. The issue is that the default behaviour of predict.flexsurvreg is to predict the mean survival for every person in the observed data, i.e. with covariate values set to the values of each person in turn. That's the typical way that predict methods seem to behave, e.g. predict.lm, predict.glm produce "fitted values" from the regression. If it doesn't have access to the observed data, it asks you instead to provide covariate values to use in prediction.

How would you want predict.flexsurvreg to behave if the data aren't included?

fthielen commented 2 years ago

I have a similar problem that I posted on stackoverflow (without any response so far unfortunately).

For my project, I want to use the fitted objects for several distributions to estimate new transition-specific parameters for a multi-state model. Without the underlying data this is a nightmare, and my “workaround” is to define separate functions for each distribution that take coefficients from the model objects (without the underlying data). For the Weibull model this looks like this:

library("flexsurv")

# Fit model
x <- flexsurvreg(Surv(time, status) ~ 1, data = cancer, dist = "weibull")

# Check coefficients
x$coefficients

# The following is part of an own function environment with several parameteric models
cf <- as.matrix(x$coefficients)

if (x$dlist$name == "weibull.quiet") {

  shape <- exp(cf["shape", 1])
  scale <- exp(cf["scale", 1])
               }

# Survival with coefficients
st_weibull <- 1 - pweibull(q = 0:100,
                           shape = shape,
                           scale = scale)

st_weibull

If you want to adjust the coefficients based on other parameters (e.g., age etc.) you need to add this to the right location parameter.

There is certainly a much better solution to this, but this is all I can come up with.

chjackson commented 2 years ago

Setting x$data <- NULL is enough to remove the data, though admittedly this is not documented.

In the development version, the summary.flexsurvreg function was rewritten, and part of the rewrite allowed it to handle better objects that were stripped of data. Though in those cases, as explained above, you would have to provide your own covariate values to do summaries or predictions, because the default behaviour is to take them from the data.

chjackson commented 2 years ago

Let me know if you find any other flexsurv functions which don't work with model objects stripped of data in this way.

fthielen commented 2 years ago

I would love to be able to use use pars.fmsm() especially when newdata is provided within the function call for instance.

Would you, @chjackson, mind to briefly answer my SO question here: https://stackoverflow.com/questions/68955039/de-identifying-survival-or-flexsurvreg-objects-in-r

mattwarkentin commented 2 years ago

Sorry, @chjackson, maybe I'm missing something, but based on what you've said it seems like this should work...but it doesn't. I am providing newdata but it still errors out.

library(flexsurv)
#> Loading required package: survival

df <- data.frame(
  time = rexp(1e5),
  status = sample(0:1, 1e5, TRUE),
  x1 = factor(sample(1:2, 1e5, TRUE))
)

fit <- flexsurvreg(Surv(time, status) ~ x1, data = df, dist = 'exp')

fit$data <- NULL

predict(fit, newdata = df, times = 2, type = 'survival')
#> Error in newdata_to_X(x, newdata, X, na.action): Covariate contrasts matrix `X` should be supplied explicitly if the data are not included in the model object

summary(fit, newdata = df, t = 2, type = 'survival')
#> Error in newdata_to_X(x, newdata, X, na.action): Covariate contrasts matrix `X` should be supplied explicitly if the data are not included in the model object
chjackson commented 2 years ago

Sorry @mattwarkentin , looks like the job was half-finished! The procedure for converting newdata to the "model matrix" X (i.e. with factors expanded to contrasts, etc) still used some code that unnecessarily relied on some attributes defining the model, that were stored alongside the data. I've separated these attributes from the data now, so this should work in https://github.com/chjackson/flexsurv-dev/commit/cc6d802721c5d2f8d7be9399f878c27a3629211a. Your models will need to be refitted under the new version. (and set ci=FALSE in your example to allow it to run instantly!).

mattwarkentin commented 2 years ago

Thats great! Thanks for the quick fix.

I assume it is also safe to remove logliki from the flexsurvreg object as well? In the interest of slimming the object as much as is reasonably worthwhile.

chjackson commented 2 years ago

Yes, logliki isn't currently used in any other code. (Ideally there'd be a specialised method that does all these stripping tasks, and identifies a model that has been stripped.)

chjackson commented 2 years ago

@fthielen try the latest version https://github.com/chjackson/flexsurv-dev/commit/68502321b8e38787c41b6a7bf3f275611555aa6f , pars.fmsm should now work with data-stripped models, e.g.

set.seed(1)
bosms3$x <- rnorm(nrow(bosms3))

## proportional hazards between transitions
bexp.cov <- flexsurvreg(Surv(years, status) ~ trans + x, data=bosms3, dist="exp")
tmat <- rbind(c(NA,1,2),c(NA,NA,3),c(NA,NA,NA))
bexp.cov$data <- NULL
pars.fmsm(bexp.cov, trans=tmat, newdata=list(x=1))

## independent models for each transition 
bexp.list <- vector(3, mode="list")
for (i in 1:3) {
  bexp.list[[i]] <- flexsurvreg(Surv(years, status) ~ x, subset=(trans==i),
                                data = bosms3, dist = "exponential")
  bexp.list[[i]]$data <- NULL
}
pars.fmsm(bexp.list, trans=tmat, newdata=list(x=1))
mattwarkentin commented 2 years ago

I can confirm everything seems to be working great now. Thanks again, @chjackson.

fthielen commented 2 years ago

@fthielen try the latest version 6850232 , pars.fmsm should now work with data-stripped models, e.g.

set.seed(1)
bosms3$x <- rnorm(nrow(bosms3))

## proportional hazards between transitions
bexp.cov <- flexsurvreg(Surv(years, status) ~ trans + x, data=bosms3, dist="exp")
tmat <- rbind(c(NA,1,2),c(NA,NA,3),c(NA,NA,NA))
bexp.cov$data <- NULL
pars.fmsm(bexp.cov, trans=tmat, newdata=list(x=1))

## independent models for each transition 
bexp.list <- vector(3, mode="list")
for (i in 1:3) {
  bexp.list[[i]] <- flexsurvreg(Surv(years, status) ~ x, subset=(trans==i),
                                data = bosms3, dist = "exponential")
  bexp.list[[i]]$data <- NULL
}
pars.fmsm(bexp.list, trans=tmat, newdata=list(x=1))

@chjackson I am sorry to keep this open. Although the posted example works, my own models result in Error in terms.formula(formula, data = data) : argument is not a valid model. The provided model is of class flexsurvreg though. The only difference I can spot is that my models are fitted with data = subset(my_data, trans == i) in flexsurvreg() while you use subset = (trans==i). Could that be a problem?

chjackson commented 2 years ago

@fthielen Did you refit your models with the new version? Sorry if this is a hassle, if the models come from your collaborators. The issue is that there is some information about the model structure, needed by pars.fmsm, which was stored together with the data. In the old version, setting ...$data <- NULL would remove this information. In the new version, this information is kept with the model object, while the individual-level data are deleted.

fthielen commented 2 years ago

@chjackson I really appreciate your prompt responses and help in this and there is really no reason to be sorry.

No, I didn't request a new fitting of the model from the collaborators but will do so. Just to be on the safe side: they would have to install the felsxsurv-dev version as well, I believe?

It might take some time for them to get back, but I will be happy to let you know how this went here (or elsewhere), if you like (although I am sure that this will solve the issue after reading your explanation above).

chjackson commented 2 years ago

That's right - the models will need to be fitted with the latest version on github.

fthielen commented 2 years ago

@fthielen Did you refit your models with the new version? Sorry if this is a hassle, if the models come from your collaborators. The issue is that there is some information about the model structure, needed by pars.fmsm, which was stored together with the data. In the old version, setting ...$data <- NULL would remove this information. In the new version, this information is kept with the model object, while the individual-level data are deleted.

@chjackson I received the newly fitted models and am currently checking them. So far, they seem to work but in the next couple of days I will "dive deeper". With this post I already want to flag that calling the AIC() or BIC() function with the newly fitted models does not work on my end. I get: "Error: $ operator is invalid for atomic vectors"

With the previously fitted models AIC and BIC can be estimated.

In addition to that, I find that lines.flexsurvreg() returns the same error.

chjackson commented 2 years ago

Hopefully AIC and BIC should be fixed in the latest commit (shouldn't need a refit), or you could directly read the $AIC or the $loglik component of the model object. The AIC() and BIC() methods depended on the function logLik.msm which was trying to count the number of observations from the data, when this information is already stored in the $N component of the model object.

plot.flexsurvreg shouldn't work by design I think if the data aren't included, but lines.flexsurvreg should - I'll investigate

chjackson commented 2 years ago

I think lines.flexsurvreg should work now, though it will need a refit unless you explicitly specify the ci argument as TRUE or FALSE.