EvolEcolGroup / tidysdm

R package to fit species distribution models (SDMs) using the 'tidymodels' framework
https://evolecolgroup.github.io/tidysdm/
GNU Affero General Public License v3.0
28 stars 8 forks source link

Can 'const' argument (terra::predict) be made with predict_raster for simple_ensemble obj? #58

Closed rylielrobinson closed 1 month ago

rylielrobinson commented 1 month ago

Hello - I am interesting in using the tidysdm package as the basis for my ensemble ML modelling for a pres/abs study. I have had success with the pipeline up until predicting my simple_ensemble across a raster stack.

I consider seasons as a categorical (now dummy) variable in my workflow. All other predictors are numeric, which I have rasters for.

Before going the ensemble route, I created a ranger (random forest) object and predict it across a raster stack using terra::predict. When I wanted to predict my ranger model across the spatial extent I was successful using the following code:

list_rf <- list()
seasons <- unique(pnts_r$seasons)
# loop through each season
for(i in 1:4) {
  list_rf[[i]] <- terra::predict(raster_stack, model, type = "response", const = data.frame(season = season[i]))
}

However, the const argument does not seem available for the tidysdm::predict_raster function for a simple_ensemble object. I have tried (unsuccessfully) to convert seasons to numeric values in hopes to supplement the season as a empty raster in the stack to no avail.

Error: [`[<-`(i)] the type of index i is unexpected: NULL

Trying to predict the simple_ensemble using terra::predict no longer works either

Error in v[cells, ] <- predv : incorrect number of subscripts on matrix
In addition: Warning message:
Unknown or uninitialised column: `xlevels`.

As a terra::predict argument const is a 'data.frame. Can be used to add a constant value as a predictor variable so that you do not need to make a SpatRaster layer for it'.

I am interested to know if something like this can be applied for tidysdm simple_ensemble objects for spatial/raster predictions?

dramanica commented 1 month ago

Hi @rylielrobinson, could you please provide us with a reprex (minimalistic dataset and code) that shows the issues, and we can then look into it. If you have dummy variables for the categorical variable, I am not clear why predict_raster() would not work.

rylielrobinson commented 1 month ago

HI @dramanica, thanks for the response. I found that I was able to get terra::predict() to work but still no luck with predict_raster(). I am glad this approach using terra::predict() works - I initially was not sure a _simpleensemble obj was compatible.

Here is a reproducible code example of what I have working:

library(tidymodels)
library(tidysdm)
library(raster)
library(terra)

v <- vect(system.file("ex/lux.shp", package="terra"))
r <- rast(system.file("ex/elev.tif", package="terra"))
set.seed(50)
pnts <- spatSample(v, size = 50)
pnts_r <- terra::extract(r, pnts, method = "simple") 

pnts_r$pa <- sample(c("presence","absence"), nrow(pnts_r), replace = TRUE)
pnts_r$pa <- factor(pnts_r$pa, levels = c("presence", "absence"))
pnts_r <- pnts_r %>% dplyr::select(-ID)

seasons <- c('winter', 'spring', 'summer', 'fall')
pnts_r$season <- seasons[seq_len(nrow(pnts_r)) %% length(seasons) + 1]

#tidymodels recipe
pa_recipe <-
  recipe(pa ~ .,
         data = pnts_r) %>%
  step_dummy(season) %>%
  step_normalize(all_numeric()) 

pa_recipe

pnts_r %>% check_sdm_presence(pa) # MUST BE TRUE

pa_models <-
  workflow_set(
    preproc = list(default = pa_recipe),
    models = list(
      glm = sdm_spec_glm(), #no tuning
      rf = sdm_spec_rf(tune = 'all'), #tuning
      gbm = sdm_spec_boost_tree(tune = 'all'), #tuning 
      maxent = sdm_spec_maxent(tune = 'all') #tuning
    ),
    cross = TRUE     # make all combinations of preproc and models,
  ) %>%
  option_add(control = control_ensemble_grid())

set.seed(123)

pa_split <- initial_split(pnts_r,
                          prop = 0.8)
pa_split

pa_cv <- vfold_cv(training(pa_split), v = 5, repeats = 2)
pa_cv # 5-fold cross-validation repeated 2 times using stratification

set.seed(1234567)
pa_models <-
  pa_models %>%
  workflow_map("tune_grid",
               resamples = pa_cv, grid = 9,
               metrics = sdm_metric_set(), verbose = TRUE)
autoplot(pa_models)

pa_models %>% collect_metrics()

pa_ensemble <- simple_ensemble() %>%
  add_member(pa_models, metric = "roc_auc")

autoplot(pa_ensemble)

test <- terra::predict(r, pa_ensemble, na.rm = T, const=data.frame(season="spring"))
test #this works!
plot(test)

However, for tidysdm while I still try: test <- predict_raster(pa_ensemble, r) There is the following error: Error in `validate_column_names()`: ! The following required columns are missing: 'season'.

I was able to find a work around by replacing seasons as a numeric value and supplementing it into a SpatRaster stack for predict_raster() though I am unsure if that is an appropriate work around. I am creating a 'pseudo' empty raster to match the season (ie. winter = 1).

library(raster)
library(terra)

v <- vect(system.file("ex/lux.shp", package="terra"))
r <- rast(system.file("ex/elev.tif", package="terra"))
set.seed(50)
pnts <- spatSample(v, size = 50)
pnts_r <- terra::extract(r, pnts, method = "simple") 
#pnts_r$temp <- sample(0:25, 50, replace = TRUE)

pnts_r$pa <- sample(c("presence","absence"), nrow(pnts_r), replace = TRUE)
pnts_r$pa <- factor(pnts_r$pa, levels = c("presence", "absence"))
pnts_r <- pnts_r %>% dplyr::select(-ID)
seasons <- 1:4
pnts_r$season <- seasons[seq_len(nrow(pnts_r)) %% length(seasons) + 1]
pnts_r$season <- factor(pnts_r$season, levels = 1:4)
#tidymodels recipe
pa_recipe <-
  recipe(pa ~ .,
         data = pnts_r) %>%
 step_dummy(season) %>% 
  step_normalize(all_numeric()) 

pa_recipe
pnts_r %>% check_sdm_presence(pa)

pa_models <-
  workflow_set(
    preproc = list(default = pa_recipe),
    models = list(
      glm = sdm_spec_glm(), #no tuning
      #gam = sdm_spec_gam(), #no tuning 
      rf = sdm_spec_rf(tune = 'all'), #tuning
      gbm = sdm_spec_boost_tree(tune = 'all'), #tuning 
      maxent = sdm_spec_maxent(tune = 'all') #tuning
    ),
    cross = TRUE     # make all combinations of preproc and models,
  ) %>%
 option_add(control = control_ensemble_grid())

library(tidysdm)
set.seed(123)

pa_split <- initial_split(pnts_r,
                          prop = 0.8)
pa_split

pa_cv <- vfold_cv(training(pa_split), v = 5, repeats = 2)
pa_cv # 5-fold cross-validation repeated 2 times using stratification

set.seed(1234567)
pa_models <-
  pa_models %>%
  workflow_map("tune_grid",
               resamples = pa_cv, grid = 10,
               metrics = sdm_metric_set(), verbose = TRUE)
beepr::beep(2)

pa_models
autoplot(pa_models)

pa_ensemble <- simple_ensemble() %>%
  add_member(pa_models, metric = "boyce_cont")

autoplot(pa_ensemble)

#tidysdm::predict_raster() approach - use 'empty' numeric raster for season num 
s <- r
values(s) <- 1
s <- terra::mask(s,r)
stack <- c(r,s)
plot(stack)
names(stack) <- c("elevation","season")

test1 <- predict_raster(pa_ensemble, stack)
plot(test1)
dramanica commented 1 month ago

To predict a factor that you transformed into a dummy variable in your recipe, you simply need to have a raster with the same categorical variable. So, if we use the ensemble with a 3 level topography variable as shown in the example at the end of https://evolecolgroup.github.io/tidysdm/articles/a2_tidymodels_additions.html#using-multi-level-factors-as-predictors, I can simply write:

# load the raster
climate_present <- terra::readRDS(system.file("extdata/lacerta_climate_present_10m.rds",
                                              package = "tidysdm"))
# first we add the categorical topography variable to the climate data (using the same rules used in the original data.frame)
# most of the time, you wouldn't need this, as the variable would already be in your input file
climate_present$topography <- climate_present$altitude
climate_present$topography <- classify(climate_present$topography, rcl = c(-Inf, 200, 800, Inf), include.lowest=TRUE, brackets=TRUE)
levels(climate_present$topography)  <- data.frame(ID = c(0,1,2), topography = c("plains", "hills", "mountains"))
# now we can predict
predict_factor <- predict_raster(lacerta_ensemble, climate_present)
plot(predict_factor)

You need to make sure that your variable is categorical (a factor) both in the data.frame passed to the recipe with step_dummy(), and in the raster, and that the name of the variable (and levels) match. Generally, you would have that categorical variable in a raster and then sample from it, so everything should be in sync. Let me know if you encounter any further problem.

dramanica commented 1 month ago

@rylielrobinson Is this issue resolved? If so, we can close it?

rylielrobinson commented 1 month ago

@dramanica this approach has worked for the reprex data - thanks