Issues doing kriging / using `predict()`

USEPA / spmodel

spmodel: Spatial Statistical Modeling and Prediction in R

https://usepa.github.io/spmodel/

GNU General Public License v3.0

12 stars 0 forks source link

Issues doing kriging / using `predict()` #11

Closed DavidJesse21 closed 1 year ago

DavidJesse21 commented 1 year ago

Hey there,

first of all, as it seems to me that your package is not that well known in the R community, I wanna say thank you for the effort that you have invested in creating this package! I really think that you are filling a gap with this package, as it has a modern design and makes use of other state-of-the-art software packages (as e.g. opposed to geoR) and has implemented the most common estimation methods (in particular REML, which I don't think is fully implemented in gstat). Also, I like that you have made a lot of effort documenting your package well, also in form of vignettes.

Unfortunately, there currently is also a not so tiny drawback, which is that the implementation of the predict() method for your model objects does not seem to be working well. I have created a reproducible example with an example data set that we planned to use for a course in spatial statistics at our university. The parameter estimation works really smoothly and is in fact much faster than geoR (though I haven't benchmarked the two packages). However, the kriging procedure simply fails, even if I choose a relatively small grid of prediction locations. I guess that it has to do with some matrix inversion as I sometimes get the error message Error in forwardsolve(cov_lowchol, c0) : invalid 'k' argument. In contrast, when I use the gstat package for kriging, no issues appear at all, as it is demonstrated in the reproducible example.

In principle, this procedure is not too complicated, but it would be nice if one could just stick with your package for the entire process of estimation and prediction. Thus, it would be really nice, if you could improve the predict() method :raised_hands:

Cheers David

PS: I assume that you are familiar with the renv package, but in case you're not, you just need to call renv::restore() after you have cloned the repo, then you should be ready to go :slightly_smiling_face:

michaeldumelle commented 1 year ago

@DavidJesse21 thanks for the kind words about the package and for the issue submission. My team will look into this and keep you updated, as it is certainly not the intended behavior of predict()! And sorry about the long response time -- I have been taking some parental leave lately!

michaeldumelle commented 1 year ago

@DavidJesse21 thanks again and great job with the reproducible example. The problem occurs upstream of the Error in forwardsolve(cov_lowchol, c0) : invalid 'k' argument error message, and it occurs because newdata (ch_grid) is an sfc_POINT object, not an sf object. The newdata argument must be a data.frame, tibble, or sf object, as noted in predict()'s documentation for the newdata argument linked here. After turning ch_grid into an sf object, the code runs successfully on my computer:

ch_grid = sf$st_as_sf(ch_grid) # replace $ with :: if outside of renv
kg_spm = predict(m_spher, newdata = ch_grid, interval = "none")

Because ch_grid is over 5000 rows, predict() will use a computationally efficient approximation by default. To override this, explicitly set local = FALSE in predict(). Both options (approximation vs no approximation) take about a minute on my computer and return nearly identical results.

I imagine others have already encountered this problem or will encounter it in the future, either when model fitting or predicting. So in the next minor spmodel update, I will add a check that returns an error if data (for model fitting) or newdata (for prediction) are not a data.frame, tibble, or sf object.

I am going to close this issue now, but please feel free to comment if you have any other questions.

DavidJesse21 commented 1 year ago

Hi @michaeldumelle, thank you very much for your reply! Turns out that the problem is much simpler than I was expecting.

I imagine others have already encountered this problem or will encounter it in the future, either when model fitting or predicting. So in the next minor spmodel update, I will add a check that returns an error if data (for model fitting) or newdata (for prediction) are not a data.frame, tibble, or sf object.

This definitely sounds like a good idea! In particular, the issue right now is that the code (using the sfc_POINT object) does not even necessarily throw an error but returns a vector or matrix full of NA values, potentially after a long run time.

Although you have already closed the issue, I would like to raise a few further points/questions:

When setting interval = "confidence" I get the same value and confidence limits for each observation. Could you maybe have a look into that?
When setting se.fit = TRUE, the function returns a list with the estimated standard deviation as a separate vector in it. Is there a particular reason for this design decision? In my opinion, it would be more natural and convenient to return it as a separate column in the matrix with the predictions.
Although spmodel::predict() now works perfectly fine, gstat::krige() is much faster. The results differ slightly, however, and I did not evaluate yet, if one of the functions performs systematically better. My initial thought was that gstat::krige() uses less observations for the kriging, but according to the documentation it seems to me that it actually uses all observations by default. So I guess, it is due to the computational implementation of the algorithm (I saw that gstat uses a lot of C-code). Maybe (if you like) you can take a note on that and see if you'd like to investigate this a little further in the future.

Cheers David

michaeldumelle commented 1 year ago

Thanks @DavidJesse21 , some responses to your inquiries below:

This is intended behavior, as you have fit an intercept-only model in your mean structure. When interval = "confidence", spmodel estimates the mean of the response at a particular location and does not predict the realized response. For an intercept-only model, the estimated mean is the same everywhere. When interval = "prediction", spmodel predicts the realized response (i.e., Kriging) and the uncertainties will change at different locations, even for an intercept-only model. For more explicit background, see spmodel's technical details vignette.
There is a particular reason -- namely, to mirror predict()'s behavior on an lm() object. I don't disagree that your suggestion is likely more natural and convenient for many, but it does not mirror base-R behavior.
You are correct that gstat::krige() is faster here, but I don't currently know enough about gstat's specific implementation to recognize the reason for this (though likely, as you point out, it has something to do with a more efficient back-end). Becausespmodel::predict() is still fairly quick and has options for large data sets (e.g., set local = TRUE), our current focus is on new features (e.g., generalized linear models) and stability. However, likely in the future we will revisit computational efficiency.
When either the observed data size or the prediction data size exceeds 5,000 (as in your case), spmodel by default uses an approximation for large data sets when calling spmodel::predict(). This approximation becomes more efficient (relative to using all of the observed data to predict) as the sample size continues to grow (note that there is not much difference between the two approaches when there are only a few thousand observations and/or prediction locations). To turn this default behavior off (i.e., use all of the observed data to predict), set local = FALSE. Then you will notice the spmodel::predict() and gstat::krige() predictions match.

Again, please reach out if you have any thoughts / feedback!