Error from predict.ds when hr with covariates and se.fit = TRUE

LHMarshall commented 1 year ago

Error reproduces using the code below:

dat <- data.frame(distance=abs(rnorm(100)), x=rnorm(100))
hrcov <- Distance::ds(dat, formula=~x, key="hr")

hr <- Distance::ds(dat, key="hr")

hncov <- Distance::ds(dat, formula=~x)

nd <- data.frame(distance = 0.25, x=0.5)

predict(hr, nd, se.fit=TRUE)

predict(hncov, nd, se.fit=TRUE)

predict(hrcov, nd)

predict(hrcov, nd, se.fit=TRUE)

LHMarshall commented 1 year ago

Also note that the documentation says

But the behaviour of the above example code is identical whether the distance column is included in the newdata or not!

The x column seems to be removed and a distance column added somewhere before line 173 in predict.ds

> predict(hrcov, nd, se.fit=TRUE)
Called from: eval(expr, p)
Browse[1]> n
debug at C:/Users/lhm/Documents/Github/mrds/R/predict.ds.R#172: if (!all(fvars %in% colnames(newdata))) {
    stop("columns in `newdata` do not match those in fitted model\n")
}
Browse[2]> n
debug at C:/Users/lhm/Documents/Github/mrds/R/predict.ds.R#173: stop("columns in `newdata` do not match those in fitted model\n")
Browse[2]> fvars
[1] "x"
Browse[2]> newdata
    distance
101        0

LHMarshall commented 1 year ago

I cannot figure out where the newdata data.frame is being modified. There is no predict.ddf function so I cannot see what is being called before predict.ds and predict.ds does not appear to be called at all if se.fit = FALSE.

> tmp <- hrcov$ddf
> class(tmp)
[1] "ds"  "ddf"
> predict(tmp, nd, se.fit=TRUE)
Called from: eval(expr, p)
Browse[1]> n
debug at C:/Users/lhm/Documents/Github/mrds/R/predict.ds.R#103: model <- object
Browse[2]> newdata
    distance
101        0

# If I explicitly call predict.ds
> mrds:::predict.ds(tmp, nd, se.fit=TRUE)
Called from: eval(expr, p)
Browse[1]> n
debug at C:/Users/lhm/Documents/Github/mrds/R/predict.ds.R#103: model <- object
Browse[2]> newdata
    x
1 0.5

Also note that calling predict with se.fit = FALSE does not appear to initiate a call to predict.ds - break point on first line of function not activated!

On line 305 of predict.ds the predict function is called recursively. While calling mrds:::prectict.ds directly leads to the original new data successfully making it into the function at this point the newdata is modified for the second recursive call to predict.ds.

LHMarshall commented 1 year ago

@lenthomas do you know anything about what might be getting called before predict.ds and / or the way in which predict.ds is called recursively when se.fit = TRUE... it seems to be called recursively many times! Thanks!

LHMarshall commented 1 year ago

The loop starting in line 163 when performed for the shape parameter doesn't have the covariates associated with it and so they are dropped from the newdata data.frame. newdata is supposed to be reset in line 238 but because the covariates are now missing they are no longer included leading to the error above when the function is later called recursively.

Based on the comments in the function it would seem sensible to replace line 238 with

newdata <- newdata_save

but I am not really sure what the current line 238 is trying to achieve it currently reads:

      # reset newdata to be the right thing
      newdata <- newdata[(nrow(model_dat)+1):nrow(newdata), , drop=FALSE]

The newdata data set are appended to the end of the data stored in the model then this line is subsetting the data back down to the similar to the new data was passed in. The only thing I can see that might be different is that the existing code leaves in distend and distbegin columns.

LHMarshall commented 1 year ago

@lenthomas I think the above proposed change fixes this bug. I've tested for binned data, left truncated data, where no newdata is supplied to predict and factor variables... this makes a small difference to the data that is passed forward in the case of binned data but from reading the code and running both with old code and new code the distend and distbegin columns do not seem to be used in the function after this point - although I cannot be sure of what is happening within the DeltaMethod function, it is passed in to that.

LHMarshall commented 1 year ago

@lenthomas can I please give you another poke to have a look at my proposed fix to this. Thanks :)

lenthomas commented 1 year ago

@LHMarshall I agree this is rather complex. TBH it seems poorly written in the sense that newdata appears to be used in multiple contexts at different points in the function, and overwritten several times; good programming practice would call for giving it a new name each time it is used for (and changed for) a different thing. This particlar issue appears to have been introduced into the code in 2016 when the shape and scale loop was introduced here. Anyway, without doing a complete rewrite, I agree your fix seems to put newdata back at about the right point, but I wondered whether it should be put back after (about) line 217 -- i.e., at the end of the if() clause that starts # handle data setup for uniform key case and before # get the bins when you have binned data. I have not tested that suggestion, or analyzed it much.

lenthomas commented 1 year ago

We talked about this at today's DistDev meeting. We decided to put an if statement around the line @LHMarshall suggested to exericse the new line just when there is hazard rate with covars. We also decided to add a test to the suite to exericse this (ie produce the standard error).

DistanceDevelopment / mrds

Error from predict.ds when hr with covariates and se.fit = TRUE #84