EvolEcolGroup / tidysdm

R package to fit species distribution models (SDMs) using the 'tidymodels' framework
https://evolecolgroup.github.io/tidysdm/
GNU Affero General Public License v3.0
28 stars 8 forks source link

Testing for sf object #62

Closed btupper closed 3 weeks ago

btupper commented 3 weeks ago

I'm trying to understand the context of the prep.spatial_recipe function. In particular, the training argument is being tested for a "geometry" column. This may lead to problems if the geometry list column is not named "geometry" - there is no requirement that the the geometry list column have any name in particular. In fact, when reading from the ubiquitous geopackage (".gpkg") formatted files it seems that the column is named "geom".

The same test occurs in bake.spatial_recipe here but with a newdata argument.

Is the point to test that the training (or newdata) object is an sf object? If so, would inherits(training, 'sf') suffice? Maybe like this...

# if we have a geometry
if (inherits(training, "sf")) {
  ## convert_geometry_column
  training <- training %>%
     dplyr::bind_cols(sf::st_coordinates(training)) %>%
      sf::st_drop_geometry()
}

On the other hand, it may be that training data is something I misunderstand. Perhaps it has been modified upstream to explicitly contain a "geometry" column? That would make my head melt, but would explain my confusion, too.

dramanica commented 3 weeks ago

If I remember correctly, this is a hack to deal with the fact that, when running a workflow, datasets that are generated from a spatial_split end up having a geometry column even though they are not of class sf. You could try to implement your change, but I think it would lead to failures if you run a Check (I am not sure whether there is a unit test that deals with that, or whether it would be caught by the full vignette when it maps the workflow). However, even if I am correct, the current solution will fail on sf objects that don't have a geometry column (as you point out). For the recipe methods, the solution would be to implement a hybrid check (check for sf AND for a geometry column). What I am not sure is whether the workflow will then create datasets with a geometry column, or whether the custom name would be adopted. I guess we would have to try that out.

btupper commented 3 weeks ago

Oh, I see. Yep - I knew it would melt my head.

May be a simple solution might be a simple heads-up to the user to name the geometry list column "geometry" ala obj = sf::st_set_geometry(obj, "geometry")? I did bump into issues while trying to use a geometry list column named geom, and it didn't complain until I tried workflow_set().

dramanica commented 3 weeks ago

@btupper Can you have a look at the branch rename_geom? It should now cope with sf objects that have alternative names for the geometry column.

btupper commented 3 weeks ago

Woooohooo! It works without issue after assigning a random name to the geometry list column. The example below borrows from your tutorial but substitutes in some local data. Thank you!

library(tidysdm)
#> Loading required package: tidymodels
#> Loading required package: spatialsample
library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(here)
#> here() starts at /Users/ben/Library/CloudStorage/Dropbox/code/projects/mola-mola

model_input = sf::read_sf("model_input.gpkg") |> 
  sf::st_set_geometry("happy_halloween")
model_input
#> Simple feature collection with 1089 features and 5 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -76 ymin: 35 xmax: -59.5 ymax: 46
#> Geodetic CRS:  WGS 84
#> # A tibble: 1,089 × 6
#>    class      sst windspeed  u_wind v_wind      happy_halloween
#>    <chr>    <dbl>     <dbl>   <dbl>  <dbl>          <POINT [°]>
#>  1 presence  12.5      4.91 -0.270  -0.543  (-66.60798 42.2943)
#>  2 presence  20.2      4.86  1.59    0.756 (-68.00023 42.60569)
#>  3 presence  14.9      8.41  2.05   -0.739 (-72.66639 40.46282)
#>  4 presence  18.1      4.31  1.28    2.79  (-69.75882 42.53246)
#>  5 presence  16.7      8.38  0.864  -1.11    (-75.3774 36.2458)
#>  6 presence  27.8      7.50 -3.31   -1.33  (-75.08756 35.54985)
#>  7 presence  12.6      4.61 -0.155  -0.274 (-67.15288 42.30458)
#>  8 presence  16.6      4.08  0.846   0.505 (-71.46238 41.07934)
#>  9 presence  17.4      4.48  0.906   0.429 (-65.18623 42.52456)
#> 10 presence  17.5      4.15 -0.0541  0.253       (-69.61 41.55)
#> # ℹ 1,079 more rows

rec <- recipe(model_input,  formula = class ~ .)

models <-
  # create the workflow_set
  workflow_set(
    preproc = list(default = rec),
    models = list(
      # the standard glm specs
      glm = tidysdm::sdm_spec_glm(),
      # rf specs with tuning
      rf = tidysdm::sdm_spec_rf(),
      # boosted tree model (gbm) specs with tuning
      gbm = tidysdm::sdm_spec_boost_tree(),
      # maxent specs with tuning
      maxent = tidysdm::sdm_spec_maxent()
    ),
    # make all combinations of preproc and models,
    cross = TRUE  ) |>
  # tweak controls to store information needed later to create the ensemble
  option_add(control = tidysdm::control_ensemble_grid())

input_cv <- spatial_block_cv(data = model_input, v = 3, n = 5)

set.seed(1234567)
models <-  models |>
  workflow_map("tune_grid",
               resamples = input_cv, 
               grid = 3,
               metrics = tidysdm::sdm_metric_set(), 
               verbose = TRUE )
#> i    No tuning parameters. `fit_resamples()` will be attempted
#> i 1 of 4 resampling: default_glm
#> ✔ 1 of 4 resampling: default_glm (264ms)
#> i 2 of 4 tuning:     default_rf
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> ✔ 2 of 4 tuning:     default_rf (2.3s)
#> i 3 of 4 tuning:     default_gbm
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> ✔ 3 of 4 tuning:     default_gbm (6s)
#> i 4 of 4 tuning:     default_maxent
#> ✔ 4 of 4 tuning:     default_maxent (1.9s)

Created on 2024-10-31 with reprex v2.1.1

dramanica commented 3 weeks ago

@btupper thanks for double-checking. Now in dev following #63