predict() se.fit = T default behavior

Thanks @henryrscharf for the kind words; we really appreciate it! We are thrilled to hear you are using it for teaching, as this was one of our primary motivations for creating the software.

Your assessment above is correct, and I have included a small reproducible example below to demonstrate:

library(spmodel)

# iid linear model using lm
lmod <- lm(log_cond ~ temp, data = lake)
lm_pred <- predict(lmod, lake_preds, se.fit = TRUE, interval = "prediction")
lm_conf <- predict(lmod, lake_preds, se.fit = TRUE, interval = "confidence")

# iid linear model using splm
spmod <- splm(log_cond ~ temp, data = lake, spcov_type = "none")
splm_pred <- predict(spmod, lake_preds, se.fit = TRUE, interval = "prediction")
splm_conf <- predict(spmod, lake_preds, se.fit = TRUE, interval = "confidence")

# se.fit data frame
se_dat <- data.frame(
  lm_pred_se = lm_pred$se.fit,
  lm_conf_se = lm_conf$se.fit,
  splm_pred_se = splm_pred$se.fit,
  splm_conf_se = splm_conf$se.fit
)

# default lm behavior for se.fit always corresponds to confidence interval
print(se_dat)
#>    lm_pred_se lm_conf_se splm_pred_se splm_conf_se
#> 1  0.24807552 0.24807552    0.9994595   0.24807552
#> 2  0.29241650 0.29241650    1.0113779   0.29241650
#> 3  0.10874041 0.10874041    0.9742701   0.10874041
#> 4  0.09785855 0.09785855    0.9731156   0.09785855
#> 5  0.09648741 0.09648741    0.9729787   0.09648741
#> 6  0.10567708 0.10567708    0.9739330   0.10567708
#> 7  0.09787592 0.09787592    0.9731174   0.09787592
#> 8  0.13882190 0.13882190    0.9780845   0.13882190
#> 9  0.11613230 0.11613230    0.9751228   0.11613230
#> 10 0.13412299 0.13412299    0.9774286   0.13412299

# prediction se for lmod? (this is the default splm behavior for se.fit)
print(lm_pred$residual.scale)
#> [1] 0.9681827

# match se_dat$splm_pred_se above
sqrt(lm_pred$residual.scale^2 + se_dat$lm_pred_se^2)
#>  [1] 0.9994595 1.0113779 0.9742701 0.9731156 0.9729787 0.9739330 0.9731174
#>  [8] 0.9780845 0.9751228 0.9774286

^{Created on 2024-09-19 with reprex v2.1.1}

While creating spmodel, we tried to be as consistent as possible with lm() (and glm()). The issue you raise, however, is one way in which we chose to deviate (but are open to reconsidering).

I know you are familiar with the details I will present below, but I want to include them explicitly for future readers of this issue. To get prediction standard errors in lm(), you either need to create them manually from prediction interval bounds or compute sqrt(.model$residual.scale^2 + .model$se.fit^2). For iid linear models, .model$residual.scale is a constant. For spatial linear models, however, the .model$residual.scale analagoue is different for each prediction location and depends on proximity to nearby observed data (prediction technical details are available at this link).

So I suppose there two options moving forward for handling se.fit when interval = "prediction":

Keep current behavior; make it more explicit the in documentation that spmodel prediction differs from base-R prediction with respect to the se.fit argument.
Make .model$residual.scale a vector equal to $\sigma^2 - c_0 \Sigma^{-1} c_0^T$, where $\sigma^2$ is the overall variance, $c_0$ is the covariance between the response at the prediction location and the observed data, and $\Sigma$ is the covariance matrix of the observed data. This way, .model$residual.scale captures residual uncertainty while .model$se.fit captures mean uncertainty and equals .model$se.fit when interval = "confidence".

I will discuss 1. and 2. above with my collaborators, but I certainly welcome your perspectives! What do you think?

It is worth noting that SSN2 (website link here) behaves the same way as spmodel with respect to se.fit, so a change in one will eventually imply a change in the other.

USEPA / spmodel

predict() se.fit = T default behavior #26