USEPA / spmodel

spmodel: Spatial Statistical Modeling and Prediction in R
https://usepa.github.io/spmodel/
GNU General Public License v3.0
13 stars 0 forks source link

predict() se.fit = T default behavior #26

Open henryrscharf opened 2 days ago

henryrscharf commented 2 days ago

Hi all,

Thanks for putting together this great R package. I've enjoyed using it so far, and it looks like it'll be a great teaching tool for my class a the University of Arizona this semester.

I don't have a bug so much as a "thing to point out" related to the consistency of your predict() method with predict.lm() and the like. I noticed that if I use predict(fitted_model, se.fit = T), I appear to get standard errors useful for creating prediction, rather than confidence, intervals. If I use predict(fitted_model, interval = "confidence" se.fit = T), then I get standard errors for the latter.

All that works as expected, my only comment is that I think the default standard errors for fitted lm and gam objects default to confidence bounds, so it might be worth considering changing the default behavior of predict.splm() to match.

I could be wrong about this, so please let me know if I've made a mistake. Just thought I'd point it out for your consideration.

Thanks again!

michaeldumelle commented 1 day ago

Thanks @henryrscharf for the kind words; we really appreciate it! We are thrilled to hear you are using it for teaching, as this was one of our primary motivations for creating the software.

Your assessment above is correct, and I have included a small reproducible example below to demonstrate:

library(spmodel)

# iid linear model using lm
lmod <- lm(log_cond ~ temp, data = lake)
lm_pred <- predict(lmod, lake_preds, se.fit = TRUE, interval = "prediction")
lm_conf <- predict(lmod, lake_preds, se.fit = TRUE, interval = "confidence")

# iid linear model using splm
spmod <- splm(log_cond ~ temp, data = lake, spcov_type = "none")
splm_pred <- predict(spmod, lake_preds, se.fit = TRUE, interval = "prediction")
splm_conf <- predict(spmod, lake_preds, se.fit = TRUE, interval = "confidence")

# se.fit data frame
se_dat <- data.frame(
  lm_pred_se = lm_pred$se.fit,
  lm_conf_se = lm_conf$se.fit,
  splm_pred_se = splm_pred$se.fit,
  splm_conf_se = splm_conf$se.fit
)

# default lm behavior for se.fit always corresponds to confidence interval
print(se_dat)
#>    lm_pred_se lm_conf_se splm_pred_se splm_conf_se
#> 1  0.24807552 0.24807552    0.9994595   0.24807552
#> 2  0.29241650 0.29241650    1.0113779   0.29241650
#> 3  0.10874041 0.10874041    0.9742701   0.10874041
#> 4  0.09785855 0.09785855    0.9731156   0.09785855
#> 5  0.09648741 0.09648741    0.9729787   0.09648741
#> 6  0.10567708 0.10567708    0.9739330   0.10567708
#> 7  0.09787592 0.09787592    0.9731174   0.09787592
#> 8  0.13882190 0.13882190    0.9780845   0.13882190
#> 9  0.11613230 0.11613230    0.9751228   0.11613230
#> 10 0.13412299 0.13412299    0.9774286   0.13412299

# prediction se for lmod? (this is the default splm behavior for se.fit)
print(lm_pred$residual.scale)
#> [1] 0.9681827

# match se_dat$splm_pred_se above
sqrt(lm_pred$residual.scale^2 + se_dat$lm_pred_se^2)
#>  [1] 0.9994595 1.0113779 0.9742701 0.9731156 0.9729787 0.9739330 0.9731174
#>  [8] 0.9780845 0.9751228 0.9774286

Created on 2024-09-19 with reprex v2.1.1

While creating spmodel, we tried to be as consistent as possible with lm() (and glm()). The issue you raise, however, is one way in which we chose to deviate (but are open to reconsidering).

I know you are familiar with the details I will present below, but I want to include them explicitly for future readers of this issue. To get prediction standard errors in lm(), you either need to create them manually from prediction interval bounds or compute sqrt(.model$residual.scale^2 + .model$se.fit^2). For iid linear models, .model$residual.scale is a constant. For spatial linear models, however, the .model$residual.scale analagoue is different for each prediction location and depends on proximity to nearby observed data (prediction technical details are available at this link).

So I suppose there two options moving forward for handling se.fit when interval = "prediction":

  1. Keep current behavior; make it more explicit the in documentation that spmodel prediction differs from base-R prediction with respect to the se.fit argument.
  2. Make .model$residual.scale a vector equal to $\sigma^2 - c_0 \Sigma^{-1} c_0^T$, where $\sigma^2$ is the overall variance, $c_0$ is the covariance between the response at the prediction location and the observed data, and $\Sigma$ is the covariance matrix of the observed data. This way, .model$residual.scale captures residual uncertainty while .model$se.fit captures mean uncertainty and equals .model$se.fit when interval = "confidence".

I will discuss 1. and 2. above with my collaborators, but I certainly welcome your perspectives! What do you think?

It is worth noting that SSN2 (website link here) behaves the same way as spmodel with respect to se.fit, so a change in one will eventually imply a change in the other.