BlasBenito / spatialRF

R package to fit spatial models with Random Forest
https://blasbenito.github.io/spatialRF/
109 stars 16 forks source link

Apply spatialRF model parameters to unseen data at finer spatial scale #16

Open nikosGeography opened 1 year ago

nikosGeography commented 1 year ago

I have one response variable and 4 predictors and I am performing a spatial random forest regression at a coarse spatial scale. My goal is to take the model parameters and apply them to a finer spatial resolution in order to predict the response variable at the finer spatial scale.

When I run

p <- stats::predict(object = model.spatial,           #name of the spatialRF model
                    data = s,                                           # data.frame containing the predictors at the fine spatial scale (without NaN values)
                    type = "response")$predictions

I am getting this error: Error in predict.ranger.forest(forest, data, predict.all, num.trees, type,: Error: One or more independent variables not found in data.

This error occurs because when I run the spatial version of RF, two more predictors are created. Rplot

Given these two extra predictors, how can I use the spatial model i created at the coarse scale to predict the response variable at a finer spatial scale?

Here is the code:

library(spatialRF)
library(stats)

wd = "path/"

block.data = read.csv(paste0(wd, "block.data.csv"))

#names of the response variable and the predictors
dependent.variable.name <- "ntl"
predictor.variable.names <- colnames(block.data)[4:7]

#coordinates of the cases
xy <- block.data[, c("x", "y")]

block.data$x <- NULL
block.data$y <- NULL

#distance matrix
distance.matrix <- as.matrix(dist(block.data))
min(distance.matrix)
max(distance.matrix)

#distance thresholds (same units as distance_matrix)
distance.thresholds <- c(0, 20, 50, 100, 200, 500)

#random seed for reproducibility
random.seed <- 456

#creating and registering the cluster
local.cluster <- parallel::makeCluster(
  parallel::detectCores() - 1,
  type = "PSOCK")
doParallel::registerDoParallel(cl = local.cluster)

# fitting a non-spatial Random Forest 
model.non.spatial <- spatialRF::rf(
data = block.data,
dependent.variable.name = dependent.variable.name,
predictor.variable.names = predictor.variable.names,
distance.matrix = distance.matrix,
distance.thresholds = distance.thresholds,
xy = xy,
seed = random.seed,
verbose = FALSE)

# Fitting a spatial model with rf_spatial()
model.spatial <- spatialRF::rf_spatial(
  model = model.non.spatial,
  method = "mem.moran.sequential", #default method
  verbose = FALSE,
  seed = random.seed)

#stopping the cluster
parallel::stopCluster(cl = local.cluster)

# export residuals of the spatialRF model at the coarse scale
rsds = as.data.frame(cbind(xy, model.spatial$residuals$values))
colnames(rsds)[3] = "resids"
coordinates(rsds) <- ~ x + y
gridded(rsds) <- TRUE
rsds <- raster(rsds)
crs(rsds) = provoliko

writeRaster(rsds, paste0(wd, "srf_resids.tif"), overwtite = TRUE)

# prediction at a finer spatial scale
s = read.csv(paste0(wd, "s.csv"))

p <- stats::predict(object = model.spatial,
                    data = s,
                    type = "response")$predictions

You can download the a small sample of the data from here.

BlasBenito commented 1 year ago

Dear Nikolaos,

I appreciate your interest in the package.

The function rf_spatial() creates spatial predictors, namely Moran's Eigenvector Maps, from the distance matrix between your training cases. These spatial predictors represent the spatial structure of the data. When included in a model, these spatial predictors help you understand to what extent the spatial distribution of your training cases is influencing your model outcomes. As such, they are only useful in explanatory models, and not so much in predictive ones.

Suppose you need to make predictions from a spatial model. In that case, you need to generate the same Moran's Eigenvector Maps for the training and prediction cases simultaneously (you can do that with the function mem_multithreshold() or with mem()), then separate your training and prediction cases in different data frames, train your model with the training cases, and finally predict over the prediction cases. This will work well if your data is regularly spaced in a grid, but it might not work at all if the data is irregular, because the spatial structures of the training and prediction data can be too different.

In your case, this is even harder because the training and the prediction data have different resolutions, and the spatial structures of both datasets will be rather different.

From my perspective, you have two options:

  1. Easy: Keep your spatial model as proof that the spatial structure matters (your best predictor is non-spatial), but do the prediction with a non spatial model fitted with rf().
  2. Harder but doable: Compute the Moran's Eigenvector Maps for the fine-grain data with mem() or mem_multithreshold(), convert these MEMs it to raster with terra::rast(), extract these values for the coarse-grain data using terra::extract(), train your model with the resulting dataset, and then predict over your fine-grain data with the original MEM's.

These things will be easier to do with the next version of the package, but I don't have a release date yet, so you are on your own in the meantime.

I'll be happy to answer any questions though!

Good luck with your analysis,

Blas

nikosGeography commented 1 year ago

Dear @BlasBenito, thank you for your comment. My data set are indeed satellite images but I'm working with their centroids (i.e., points). This means that, they are regularly spaced. Of course, as you said, their spatial structure is different because I'm creating a model at a coarse spatial scale and I want predict at a finer spatial scale.

If you don't mind, I would like to keep this issue open until I manage to predict at a finer spatial scale following the (helpful) steps you wrote.

BlasBenito commented 1 year ago

I don't mind it at all. It might be useful for other people anyway, it's not the first time this question has come up.

barnabasharris commented 11 months ago

@BlasBenito I was wondering, if its not comptationally tractable to calculate a distance matrix for the entire predictive domain, do you think there is any utility in: 1) aggregating the resolution to a manageable size and calculating the Moran's Eigenvector Maps for this coarse-grain data with mem() or mem_multithreshold(), then convert to raster. 2) using terra::extract() for training locations and train a non-spatial model with the Moran's Eigenvector as added co-variates 3) dissaggrating the coarse-grain Eigenvector Maps to a finer resolution, and adding them as covariates in finer-grain data 4) making predictions on this finer resolution data.

Obviously, in this scenario the Moran's Eigenvector values aren't strictly accurate for each prediction location but represent a general value that applies to the aggregated cell.

Can provide a reprex if that would help.