Closed WetRobot closed 8 years ago
It is notable that creating the variables by hand in data works just fine:
> lung$agespl <- ns(lung$age, df = 4)
> lung$sex <- factor(lung$sex)
> fl4 <- flexsurvspline(Surv(time, event = status) ~ sex + agespl, data = lung, k = 2)
>
> su <- summary(fl4, newdata = lung, B = 0)
Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
Covariates "inst","time","status","age","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring
> su <- summary(fl4, newdata = lung[1,], B = 0)
Warning message:
In form.model.matrix(object, as.data.frame(newdata)) :
Covariates "inst","time","status","age","ph.ecog","ph.karno","pat.karno","meal.cal","wt.loss" unknown, ignoring
Thanks, confirmed and on the todo list. There are (at least) two different bugs here, both in form.model.matrix.
One results in models with spline-type predictors not working, as in the ns() example, I think since the linear predictor for the new data depends on a basis that was calculated using the old data. Not immediately sure of the remedy.
The other is from a line that just identifies the names of variables in a formula by stripping off the brackets, which doesn't work for your cut(age,4) example. That should be easier to fix.
Perhaps the warning is inappropriate as well, I see now that predict.lm, for example, doesn't complain when "newdata" contains variables that are not in the model.
I think just using the variables existing in the formula silently would be appropriate.
I'm sure you've already thought of this, but for the second problem all.vars()
/ all.names()
are your friends.
For the first problem... perhaps if information about the original spline fit is saved somewhere in the resulting model object, one could pick that information and apply it somewhere when forming the new model.matrix. Usage of a spline basis transform on a variable could be detected e.g. using all.names()
? I think there is better way though.
In any case, in the outputted model object I see e.g.
> spl <- flexsurvspline(Surv(recyrs, censrec) ~ group + splines::ns(1:nrow(bc)), data=bc, k=1, scale="odds")
>
> str(spl)
$ data :List of 3
..$ Y : num [1:686, 1:6] 3.68 4.32 4.82 3.16 2.65 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:686] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:6] "time" "status" "start" "stop" ...
..$ m :'data.frame': 686 obs. of 4 variables:
.. ..$ Surv(recyrs, censrec) : Surv [1:686, 1:2] 3.6767+ 4.3233+ 4.8219+ 3.1562+ 2.6493+ 1.7233+ 1.2630+ 2.7151 2.0767+ 4.4301+ ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : chr [1:2] "time" "status"
.. .. ..- attr(*, "type")= chr "right"
.. ..$ group : Factor w/ 3 levels "Good","Medium",..: 1 1 1 1 1 1 1 1 1 1 ...
.. ..$ splines::ns(1:nrow(bc)): ns [1:686, 1] 0 0.00117 0.00234 0.00351 0.00468 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : NULL
.. .. .. ..$ : chr "1"
.. .. ..- attr(*, "degree")= int 3
.. .. ..- attr(*, "knots")= num(0)
.. .. ..- attr(*, "Boundary.knots")= int [1:2] 1 686
.. .. ..- attr(*, "intercept")= logi FALSE
.. .. ..- attr(*, "class")= chr [1:3] "ns" "basis" "matrix"
So meta information about the spline basis formation is recorded, though it may be cumbersome to use.
It does not seem like an entirely bad thing to simply throw an informative error if splines are used in the model supplied to summary.flexsurvreg
.
I think I've fixed these in the latest commit. I did what predict.lm does, thus form.model.matrix in flexsurvreg.R is vastly simplified.
In the spline model, the knots are stored in "predvars" in the terms attribute of the model frame. This is interpreted by model.frame.default when forming the new model frame for newdata.
In the case of cut(), I think the second summary from your model fl2 shouldn't work, as written. cut(age,4) uses the quantiles of the data to define the cut points, but the quantiles of lung[1,] will not match those of the original data, lung. Thus the factor levels in newdata will be different from the levels in the original data, and you'll get the error "factor cut(age, 4) has new level...". Though if you supply explicit break points in cut() it should be OK. This is the same behaviour as predict.lm.
It no longer warns when extra variables are supplied in newdata. It should still complain when a required variable is missing from newdata, or if a variable that was originally a factor is numeric in newdata, or if a factor has different levels in newdata.
One change is that it now insists that factors are supplied as factors in newdata, or as character variables that match the original factor levels. It won't try to coerce numerics to factors any more - some may see that as a downgrade but I think it's more proper behaviour, and consistent with other packages.
Thanks again for the report!
Great stuff, happy to help!
It appears that variables on the right-hand side of the formula in
flexsurvspline
at least do not work properly insummary.flexsurvreg
:Limiting to only used variables did not solve the problem. I did not test using
X
instead ofnewdata
. This issue exists in both the newest CRAN (0.6) release and today's dev version.