chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
55 stars 28 forks source link

summary.flexsurvreg mishandles expression-form variables in formula? #7

Closed WetRobot closed 8 years ago

WetRobot commented 8 years ago

It appears that variables on the right-hand side of the formula in flexsurvsplineat least do not work properly in summary.flexsurvreg:

> library(flexsurv)
> 
> data(lung)
> 
> fl1 <- flexsurvspline(Surv(time, event = status) ~ factor(sex) + ns(age), data = lung, k = 2)
> 
> su <- summary(fl1, newdata = lung, B = 0)
Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
  Covariates "inst","time","status","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring
> su <- summary(fl1, newdata = lung[1,], B = 0)
Error in qr.default(t(const)) : 
  NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
  Covariates "inst","time","status","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring
> 
> 
> fl2 <- flexsurvspline(Surv(time, event = status) ~ factor(sex) + cut(age,4), data = lung, k = 2)
> 
> su <- summary(fl2, newdata = lung, B = 0)
Error in `[.data.frame`(newdata, , facnames[i]) : 
  undefined columns selected
In addition: Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
  Covariates "inst","time","status","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring
> su <- summary(fl2, newdata = lung[1,], B = 0)
Error in `[.data.frame`(newdata, , facnames[i]) : 
  undefined columns selected
In addition: Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
  Covariates "inst","time","status","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring
> 
> 
> 
> fl3 <- flexsurvspline(Surv(time, event = status) ~ sex + age, data = lung, k = 2)
> 
> su <- summary(fl3, newdata = lung, B = 0)
Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
  Covariates "inst","time","status","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring
> su <- summary(fl3, newdata = lung[1,], B = 0)
Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
  Covariates "inst","time","status","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring

Limiting to only used variables did not solve the problem. I did not test using X instead of newdata. This issue exists in both the newest CRAN (0.6) release and today's dev version.

WetRobot commented 8 years ago

It is notable that creating the variables by hand in data works just fine:

> lung$agespl <- ns(lung$age, df = 4)
> lung$sex <- factor(lung$sex)
> fl4 <- flexsurvspline(Surv(time, event = status) ~ sex + agespl, data = lung, k = 2)
> 
> su <- summary(fl4, newdata = lung, B = 0)
Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
  Covariates "inst","time","status","age","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring
> su <- summary(fl4, newdata = lung[1,], B = 0)
Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
  Covariates "inst","time","status","age","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring
chjackson commented 8 years ago

Thanks, confirmed and on the todo list. There are (at least) two different bugs here, both in form.model.matrix.

One results in models with spline-type predictors not working, as in the ns() example, I think since the linear predictor for the new data depends on a basis that was calculated using the old data. Not immediately sure of the remedy.

The other is from a line that just identifies the names of variables in a formula by stripping off the brackets, which doesn't work for your cut(age,4) example. That should be easier to fix.

Perhaps the warning is inappropriate as well, I see now that predict.lm, for example, doesn't complain when "newdata" contains variables that are not in the model.

WetRobot commented 8 years ago

I think just using the variables existing in the formula silently would be appropriate.

I'm sure you've already thought of this, but for the second problem all.vars() / all.names() are your friends.

For the first problem... perhaps if information about the original spline fit is saved somewhere in the resulting model object, one could pick that information and apply it somewhere when forming the new model.matrix. Usage of a spline basis transform on a variable could be detected e.g. using all.names()? I think there is better way though.

In any case, in the outputted model object I see e.g.

> spl <- flexsurvspline(Surv(recyrs, censrec) ~ group + splines::ns(1:nrow(bc)), data=bc, k=1, scale="odds")
> 
> str(spl)
$ data          :List of 3
  ..$ Y  : num [1:686, 1:6] 3.68 4.32 4.82 3.16 2.65 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:686] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:6] "time" "status" "start" "stop" ...
  ..$ m  :'data.frame': 686 obs. of  4 variables:
  .. ..$ Surv(recyrs, censrec)  : Surv [1:686, 1:2] 3.6767+ 4.3233+ 4.8219+ 3.1562+ 2.6493+ 1.7233+ 1.2630+ 2.7151  2.0767+ 4.4301+ ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : chr [1:2] "time" "status"
  .. .. ..- attr(*, "type")= chr "right"
  .. ..$ group                  : Factor w/ 3 levels "Good","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ splines::ns(1:nrow(bc)): ns [1:686, 1] 0 0.00117 0.00234 0.00351 0.00468 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : chr "1"
  .. .. ..- attr(*, "degree")= int 3
  .. .. ..- attr(*, "knots")= num(0) 
  .. .. ..- attr(*, "Boundary.knots")= int [1:2] 1 686
  .. .. ..- attr(*, "intercept")= logi FALSE
  .. .. ..- attr(*, "class")= chr [1:3] "ns" "basis" "matrix"

So meta information about the spline basis formation is recorded, though it may be cumbersome to use.

It does not seem like an entirely bad thing to simply throw an informative error if splines are used in the model supplied to summary.flexsurvreg.

chjackson commented 8 years ago

I think I've fixed these in the latest commit. I did what predict.lm does, thus form.model.matrix in flexsurvreg.R is vastly simplified.

In the spline model, the knots are stored in "predvars" in the terms attribute of the model frame. This is interpreted by model.frame.default when forming the new model frame for newdata.

In the case of cut(), I think the second summary from your model fl2 shouldn't work, as written. cut(age,4) uses the quantiles of the data to define the cut points, but the quantiles of lung[1,] will not match those of the original data, lung. Thus the factor levels in newdata will be different from the levels in the original data, and you'll get the error "factor cut(age, 4) has new level...". Though if you supply explicit break points in cut() it should be OK. This is the same behaviour as predict.lm.

It no longer warns when extra variables are supplied in newdata. It should still complain when a required variable is missing from newdata, or if a variable that was originally a factor is numeric in newdata, or if a factor has different levels in newdata.

One change is that it now insists that factors are supplied as factors in newdata, or as character variables that match the original factor levels. It won't try to coerce numerics to factors any more - some may see that as a downgrade but I think it's more proper behaviour, and consistent with other packages.

Thanks again for the report!

WetRobot commented 8 years ago

Great stuff, happy to help!