chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
54 stars 28 forks source link

Running pmatrix.fs in a different setting to that which the models were run, where it is not possible to store/transfer the whole model object #60

Closed dmcalli2 closed 2 years ago

dmcalli2 commented 5 years ago

Hello, Congratulations on a great package, with excellent documentation. The following problem has a workaround-type solution. I need to fit the models in one setting (a "safe haven" type environment), and combine them to estimate the transition probabilities in another where I have access to more computing power. Unfortunately, I cannot save the model object as a whole because of security/privacy issues (staff at the safe haven need to be able to review outputs for security/privacy issues prior to releasing them to me). Nonetheless, the pmatrix.fs function worked fine (including the CI function) after the following steps:-

  1. reduce the data elements of the model object to a single row (or single element in the case of a vector) all values NA
  2. reduce the individual log-likelihoods to a single NA
  3. Write the object to a text file with dput Then outside the "safe haven" environment
  4. dget the model objects
  5. Source the R scripts from the package (as pmatrix.fs couldnt find the functions otherwise)
  6. Run pmatrix

In case anyone else is interested in doing this, and in case there is any interest in incorporating this functionality into the package, I have included the little function I used to strip the data below. I simply applied this to a list of models.

StripData <- function(model_object){
  model_object$data <- lapply(model_object$data, 
                              function(x){
                                          x[] <- NA
                                          if(length(dim(x) == 2)) x[1,] else
                                            x[1]
                                        })
  model_object$logliki <- NA
  model_object$logliki <- model_object$logliki[1]
  model_object
}

Additional processes

I have subsequently gone through the model again, trying to make the object smaller to facilitate review by staff in the safe haven. I see that I can also set the following to NULL, and the model still works - this is for a model which includes covariates and when we estimate confidence intervals

c("call", "aux", "ncoveffs", "mx", "basepars", "AIC", "datameans", 
                "N", "events", "trisk", "concat.formula", "res", "coefficients", 
                "npars", "fixedpars", "loglik", "logliki", "cl")

Finally, I further shortened the text files with the models by dropping elements of the model objects which were identical() across models, keeping these in one model only.

dmcalli2 commented 5 years ago

Any help gratefully received! Please let me know if I should go to stackoverflow with this instead.

The approach outlined above FAILS when I include covariates in the model. I think it is an issue with Surv because when I write the object to text and read it back in (using either dump/source or dput/dget) I get the message that Error in Surv(years, status) : object 'years' not found.

Below is a minimal example based on an example in the manual with an extra covariate added.

# library(deSolve) # for ode function
library(flexsurv)
library(dplyr)

## Create list of models
bosms3 <- bosms3 %>% 
  mutate(var1 = rnorm(nrow(bosms3), mean = status*0.2, sd = 0.5))
bexp.list <- vector(3, mode="list")

for (i in 1:3) { 
  bexp.list[[i]] <- flexsurvreg(Surv(years, status) ~ var1,
                                subset= (trans ==i),
                                data = bosms3, dist="exp")
}
## Create transition matrix
tmat <- rbind(c(NA,1,2),c(NA,NA,3),c(NA,NA,NA))

## Write model object to a text file
dput(bexp.list, file = "temp.txt")
## Read model from text file
bexp.list_read <- dget("temp.txt")

# Succeeds
pmatrix.fs(bexp.list, tmat, newdata = data.frame(var1 = 1))
# Fails with message "Error in Surv(years, status) : object 'years' not found"
pmatrix.fs(bexp.list_read, tmat, newdata = data.frame(var1 = 1))
chjackson commented 5 years ago

That doesn't reproduce for me, the dput() command just puts the following text in "temp.txt", which makes dget() fail. Same happens when using dump()

list(structure(list(call = flexsurvreg(formula = Surv(years, 
    status) ~ var1, data = bosms3, subset = (trans == i), dist = "exp"), 
  ...

It's safer to save and load models using save() and load(), see help(dput).

dmcalli2 commented 5 years ago

Many thanks for getting back in touch. Sorry that doesn't reproduce. It does run as is on my machine, even after an R restart. I have modified the code to dump/source, which exhibits the same behavour (please see below) on my machine, and might reproduce on yours? I appreciate the point about save, and also read that in the manual, but I need to transfer the model out of the safe haven in a format that a human can easily review. I reviewed the output from save as a text file where ascii = TRUE but it is too large for a person to review. I think the problem I am having is that the environment is different when the dumped text is sourced, which is why Surv cannot find the time variable "years". I am currently stepping through the function to see if I can get round this issue. I have been able to get it to work with the model object dget from text if I supply the object produced by form.model.matrix() manually, rather than calculating it in the function. However, this only works where a single value is provided for each covariate, not in the case where a value is supplied for each allowed transition (although I can't seem to get that to work even for the original model object and the function run from the package). If you don't mind me doing so, I will post here if I can get it to work. I know of at least one other project (with Jim Lewsey who I think you may have met) which would benefit from being able to run pmatrix.fs in this way.

# library(deSolve) # for ode function
library(flexsurv)
library(dplyr)

## Create list of models
bosms3 <- bosms3 %>% 
  mutate(var1 = rnorm(nrow(bosms3), mean = status*0.2, sd = 0.5))
bexp.list <- vector(3, mode="list")

for (i in 1:3) { 
  bexp.list[[i]] <- flexsurvreg(Surv(years, status) ~ var1,
                                subset= (trans ==i),
                                data = bosms3, dist="exp")
}
## Create transition matrix
tmat <- rbind(c(NA,1,2),c(NA,NA,3),c(NA,NA,NA))

## Write model object to a text file WITH DUMP
bexp.list.fordump <- bexp.list
dump("bexp.list.fordump", file = "temp.R")
## Read model from text file WITH DUMP
source("temp.R")

# Succeeds
pmatrix.fs(bexp.list, tmat, newdata = data.frame(var1 = 1))
# Fails with message "Error in Surv(years, status) : object 'years' not found"
pmatrix.fs(bexp.list_read, tmat, newdata = data.frame(var1 = 1))
chjackson commented 5 years ago

But if dput and dump aren't guaranteed to save the full contents of the object, surely they're not doing what you need in terms of allowing scrutiny of those contents by a third party? Would it work to use save(), then after using load(), do lapply(bexp.list, unclass) to print the contents of the list of model objects in a readable format for review?

dmcalli2 commented 5 years ago

Thanks for the suggestion. I'll discuss that approach. There is a general worry I have come across in several safe havens about binary file formats, because accidentally letting data is possible, and that would potentially be catastrophic. Some even have hard rules about which formats can be used to export data/results. For GLM models it is easy, as I can just export coefficients and the covariance matrix. However, I suspect this may be a more general issue for many modelling projects. I will explore this issue with some people I know who work in information governance. Thanks for your time.

chjackson commented 2 years ago

Note: Version 2.2 of flexsurv allows the original data to be safely stripped from the fitted model object x with x$data <- NULL, which may be helpful to those working with private data.