Model-R / modleR

An ecological niche model workflow based on dismo
https://model-r.github.io/modleR
51 stars 24 forks source link
dismo ecological-niche-models r r-package rstats

modleR: a workflow for ecological niche models

:registry status
badge

R-CMD-check

modleR is a workflow based on package dismo (Hijmans et al. 2017), designed to automatize some of the common steps when performing ecological niche models. Given the occurrence records and a set of environmental predictors, it prepares the data by cleaning for duplicates, removing occurrences with no environmental information and applying some geographic and environmental filters. It executes crossvalidation or bootstrap procedures, then it performs ecological niche models using several algorithms, some of which are already implemented in the dismo package, and others come from other packages in the R environment, such as glm, Support Vector Machines and Random Forests.

Citation

Andrea Sánchez-Tapia, Sara Ribeiro Mortara, Diogo Souza Bezerra Rocha, Felipe Sodré Mendes Barros, Guilherme Gall, Marinez Ferreira de Siqueira. modleR: a modular workflow to perform ecological niche modeling in R. https://www.biorxiv.org/content/10.1101/2020.04.01.021105v1

Installing

Currently modleR can be installed from GitHub:

# Without vignette
remotes::install_github("Model-R/modleR", build = TRUE)
# With vignette
remotes::install_github("Model-R/modleR",
                        build = TRUE,
                        dependencies = TRUE,
                        build_opts = c("--no-resave-data", "--no-manual"),
                        build_vignettes = TRUE)

Note regarding vignette building: the default parameters in build_opts include --no-build-vignettes. In theory, removing this will include the vignette on the installation but we have found that build_vignettes = TRUE is also necessary. During installation, R may ask to install or update some packages. If any of these return an error you can install them apart by running install.packages() and retry. When building the vignette, package rJava and a JDK will be needed. Also, make sure that the maxent.jar file is available and in the java folder of package dismo. Please download it here. Vignette building may take a while during installation.

Packages kuenm and maxnet should be installed from GitHub:

remotes::install_github("marlonecobos/kuenm")
remotes::install_github("mrmaxent/maxnet")

The workflow

The workflow consists of mainly four functions that should be used sequentially.

  1. Setup: setup_sdmdata() prepares and cleans the data, samples the pseudoabsences, and organizes the experimental design (bootstrap, crossvalidation or repeated crossvalidation). It creates a metadata file with details for the current round and a sdmdata file with the data used for modeling
  2. Model fitting and projecting: do_any() makes the ENM for one algorithm and partition; optionally, do_many() calls do_any() to fit multiple algorithms
  3. Partition joining: final_model() joins the partition models into a model per species per algorithm
  4. Ensemble: ensemble_model() joins the different models per algorithm into an ensemble model (algorithmic consensus) using several methods.

Folder structure created by this package

modleR writes the outputs in the hard disk, according to the following folder structure:

models_dir
├── projection1
│   ├── data_setup
│   ├── partitions
│   ├── final_models
│   └── ensemble_models
└── projection2
    ├── data_setup
    ├── partitions
    ├── final_models
    └── ensemble_models

The example dataset

modleR comes with example data, a list called example_occs with occurrence data for four species, and predictor variables called example_vars.

library(modleR)
str(example_occs)
#> List of 4
#>  $ Abarema_langsdorffii:'data.frame':    104 obs. of  3 variables:
#>   ..$ sp : chr [1:104] "Abarema_langsdorffii" "Abarema_langsdorffii" "Abarema_langsdorffii" "Abarema_langsdorffii" ...
#>   ..$ lon: num [1:104] -40.6 -40.7 -41.2 -41.7 -42.5 ...
#>   ..$ lat: num [1:104] -19.9 -20 -20.3 -20.5 -20.7 ...
#>  $ Eugenia_florida     :'data.frame':    341 obs. of  3 variables:
#>   ..$ sp : chr [1:341] "Eugenia_florida" "Eugenia_florida" "Eugenia_florida" "Eugenia_florida" ...
#>   ..$ lon: num [1:341] -35 -34.9 -34.9 -36.4 -42.1 ...
#>   ..$ lat: num [1:341] -6.38 -7.78 -8.1 -10.42 -2.72 ...
#>  $ Leandra_carassana   :'data.frame':    82 obs. of  3 variables:
#>   ..$ sp : chr [1:82] "Leandra_carassana" "Leandra_carassana" "Leandra_carassana" "Leandra_carassana" ...
#>   ..$ lon: num [1:82] -39.3 -39.6 -40.7 -41.2 -41.5 ...
#>   ..$ lat: num [1:82] -15.2 -15.4 -20 -20.3 -20.4 ...
#>  $ Ouratea_semiserrata :'data.frame':    90 obs. of  3 variables:
#>   ..$ sp : chr [1:90] "Ouratea_semiserrata" "Ouratea_semiserrata" "Ouratea_semiserrata" "Ouratea_semiserrata" ...
#>   ..$ lon: num [1:90] -40 -42.5 -42.4 -42.9 -42.6 ...
#>   ..$ lat: num [1:90] -16.4 -20.7 -19.5 -19.6 -19.7 ...
species <- names(example_occs)
species
#> [1] "Abarema_langsdorffii" "Eugenia_florida"      "Leandra_carassana"   
#> [4] "Ouratea_semiserrata"
library(sp)
par(mfrow = c(2, 2), mar = c(2, 2, 3, 1))
for (i in 1:length(example_occs)) {
  plot(!is.na(example_vars[[1]]),
       legend = FALSE,
       main = species[i],
       col = c("white", "#00A08A"))
  points(lat ~ lon, data = example_occs[[i]], pch = 19)
}
par(mfrow = c(1, 1))
Figure 1. The example dataset: predictor variables and occurrence for four species.

We will filter the example_occs file to select only the data for the first species:

occs <- example_occs[[1]]

Cleaning and setting up the data: setup_sdmdata()

The first step of the workflow is to setup the data, that is, to partition it according to each project needs, to sample background pseudoabsences and to apply some data cleaning procedures, as well as some filters. This is done by function setup_sdmdata()

setup_sdmdata() has a large number of parameters:

args(setup_sdmdata)
#> function (species_name, occurrences, predictors, lon = "lon", 
#>     lat = "lat", models_dir = "./models", real_absences = NULL, 
#>     buffer_type = NULL, dist_buf = NULL, env_filter = FALSE, 
#>     env_distance = "centroid", buffer_shape = NULL, min_env_dist = NULL, 
#>     min_geog_dist = NULL, write_buffer = FALSE, seed = NULL, 
#>     clean_dupl = FALSE, clean_nas = FALSE, clean_uni = FALSE, 
#>     geo_filt = FALSE, geo_filt_dist = NULL, select_variables = FALSE, 
#>     cutoff = 0.8, sample_proportion = 0.8, png_sdmdata = TRUE, 
#>     n_back = 1000, partition_type = c("bootstrap"), boot_n = 1, 
#>     boot_proportion = 0.7, cv_n = NULL, cv_partitions = NULL) 
#> NULL

There are a couple options for data cleaning:

The function also sets up different experimental designs:

Pseudoabsence sampling is performed by function has also some options:

Pseudoabsence points will be sampled (using dismo::randomPoints()) within the buffer and outside the environmental filter, in order to control for the area accessible to the species (M in the BAM diagram).

test_folder <- "~/modleR_test"
sdmdata_1sp <- setup_sdmdata(species_name = species[1],
                             occurrences = occs,
                             predictors = example_vars,
                             models_dir = test_folder,
                             partition_type = "crossvalidation",
                             cv_partitions = 5,
                             cv_n = 1,
                             seed = 512,
                             buffer_type = "mean",
                             png_sdmdata = TRUE,
                             n_back = 500,
                             clean_dupl = TRUE,
                             clean_uni = TRUE,
                             clean_nas = TRUE,
                             geo_filt = FALSE,
                             geo_filt_dist = 10,
                             select_variables = TRUE,
                             sample_proportion = 0.5,
                             cutoff = 0.7)
#> metadata file found, checking metadata
#> running data setup
#> cleaning data
#> cleaning duplicates
#> cleaning occurrences with no environmental data
#> cleaning occurrences within the same pixel
#> 5 points removed
#> 99 clean points
#> creating buffer
#> Applying buffer
#> Warning in RGEOSDistanceFunc(spgeom1, spgeom2, byid, "rgeos_distance"): Spatial
#> object 1 is not projected; GEOS expects planar coordinates
#> Warning: GEOS support is provided by the sf and terra packages among others
#> Warning in rgeos::gBuffer(spgeom = occurrences, byid = FALSE, width =
#> dist.buf): Spatial object is not projected; GEOS expects planar coordinates
#> sampling pseudoabsence points with mean buffer
#> selecting variables...
#> No variables were excluded with cutoff = 0.7
#> saving metadata
#> extracting environmental data
#> extracting background data
#> performing data partition
#> saving sdmdata
#> Plotting the dataset...
#> DONE!

Fitting a model per partition: do_any() and do_many()

Functions do_any() and do_many() create a model per partition, per algorithm. The difference between these functions that do_any() performs modeling for one individual algorithm at a time, that can be chosen by using parameter algorithm, while do_many() can select multiple algorithms, with TRUE or FALSE statements (just as BIOMOD2 functions do).

The available algorithms are:

Details for the implementation of each model can be accessed in the documentation of the function.

Here you can see the differences between the parameters of both functions. do_many() calls several instances of do_any() Sometimes you may only want to call do_many() but for better control and parallelization by algorithm it may be better to call do_any() individually.

args(do_any)
#> function (species_name, predictors, models_dir = "./models", 
#>     algorithm = c("bioclim"), project_model = FALSE, proj_data_folder = "./data/proj", 
#>     mask = NULL, write_rda = FALSE, png_partitions = FALSE, write_bin_cut = FALSE, 
#>     dismo_threshold = "spec_sens", equalize = TRUE, sensitivity = 0.9, 
#>     proc_threshold = 0.5, ...) 
#> NULL
args(do_many)
#> function (species_name, bioclim = FALSE, domain = FALSE, glm = FALSE, 
#>     mahal = FALSE, maxent = FALSE, maxnet = FALSE, rf = FALSE, 
#>     svmk = FALSE, svme = FALSE, brt = FALSE, ...) 
#> NULL

Calling do_many() and setting bioclim = TRUE is therefore equivalent to call do_any() and set algorithm = "bioclim".

sp_maxnet <- do_any(species_name = species[1],
                    algorithm = "maxnet",
                    predictors = example_vars,
                    models_dir = test_folder,
                    png_partitions = TRUE,
                    write_bin_cut = FALSE,
                    equalize = TRUE,
                    write_rda = TRUE)

The resulting object is a table with the performance metrics, but the actual output is written on disk

sp_maxnet
#>                kappa spec_sens no_omission prevalence equal_sens_spec
#> thresholds 0.5466117 0.4121507   0.2633284  0.1707096       0.3985237
#>            sensitivity         species_name algorithm run partition presencenb
#> thresholds   0.3257664 Abarema_langsdorffii    maxnet   1         1         20
#>            absencenb correlation    pvaluecor   AUC AUC_pval AUCratio     pROC
#> thresholds       100    0.747932 9.702981e-23 0.971       NA    1.942 1.882305
#>            pROC_pval TSSmax  KAPPAmax dismo_threshold prevalence.value
#> thresholds         0   0.82 0.8043478       spec_sens        0.1666667
#>                  PPP       NPP TPR  TNR  FPR FNR       CCR     Kappa   F_score
#> thresholds 0.6923077 0.9787234 0.9 0.92 0.08 0.1 0.9166667 0.7321429 0.7826087
#>              Jaccard
#> thresholds 0.6428571

The following lines call for bioclim, GLM, random forests, BRT, svme (from package e1071), and smvk (from package kernlab)

many <- do_many(species_name = species[1],
                predictors = example_vars,
                models_dir = test_folder,
                png_partitions = TRUE,
                write_bin_cut = FALSE,
                write_rda = TRUE,
                bioclim = TRUE,
                domain = FALSE,
                glm = TRUE,
                svmk = TRUE,
                svme = TRUE,
                maxent = FALSE,
                maxnet = TRUE,
                rf = TRUE,
                mahal = FALSE,
                brt = TRUE,
                equalize = TRUE)

In addition:

At the end of a modeling round, the partition folder containts:

Joining partitions: final_model()

There are many ways to create a final model per algorithm per species. final_model() follows the following logic:

args(final_model)
#> function (species_name, algorithms = NULL, scale_models = TRUE, 
#>     consensus_level = 0.5, models_dir = "./models", final_dir = "final_models", 
#>     proj_dir = "present", which_models = c("raw_mean"), mean_th_par = c("spec_sens"), 
#>     uncertainty = FALSE, png_final = TRUE, sensitivity = 0.9, 
#>     ...) 
#> NULL
final_model(species_name = species[1],
            algorithms = NULL, #if null it will take all the algorithms in disk
            models_dir = test_folder,
            which_models = c("raw_mean",
                             "bin_mean",
                             "bin_consensus"),
            consensus_level = 0.5,
            uncertainty = TRUE,
            overwrite = TRUE)

final_model() creates a .tif file for each final.model (one per algorithm) under the specified folder (default: final_models)

The raw_mean final models for each algorithm are these:

Algorithmic consensus with ensemble_model()

The fourth step of the workflow is joining the models for each algorithm into a final ensemble model. ensemble_model() calculates the mean, standard deviation, minimum and maximum values of the final models and saves them under the folder specified by ensemble_dir. It can also create these models by a consensus rule (what proportion of final models predict a presence in each pixel, 0.5 is a majority rule, 0.3 would be 30% of the models).

ensemble_model() uses a which_final parameter -analog to which_model in final_model() to specify which final model(s) (Figure 2) should be assembled together (the default is a mean of the raw continuous models: which_final = c("raw_mean")).

args(ensemble_model)
#> function (species_name, occurrences, lon = "lon", lat = "lat", 
#>     models_dir = "./models", final_dir = "final_models", ensemble_dir = "ensemble", 
#>     proj_dir = "present", algorithms = NULL, which_ensemble = c("average"), 
#>     which_final = c("raw_mean"), performance_metric = "TSSmax", 
#>     dismo_threshold = "spec_sens", consensus_level = 0.5, png_ensemble = TRUE, 
#>     write_occs = FALSE, write_map = FALSE, scale_models = TRUE, 
#>     uncertainty = TRUE, ...) 
#> NULL
ens <- ensemble_model(species_name = species[1],
                      occurrences = occs,
                      performance_metric = "pROC",
                      which_ensemble = c("average",
                                         "best",
                                         "frequency",
                                         "weighted_average",
                                         "median",
                                         "pca",
                                         "consensus"),
                      consensus_level = 0.5,
                      which_final = "raw_mean",
                      models_dir = test_folder,
                      overwrite = TRUE) #argument from writeRaster
#> [1] "Thu Aug  3 11:36:24 2023"
#> [1] "DONE!"
#> [1] "Thu Aug  3 11:36:36 2023"
plot(ens)

Workflows with multiple species

Our example_occs dataset has data for four species. An option to do the several models is to use a for loop

args(do_many)
args(setup_sdmdata)

for (i in 1:length(example_occs)) {
  sp <- species[i]
  occs <- example_occs[[i]]
  setup_sdmdata(species_name = sp,
                models_dir = "~/modleR_test/forlooptest",
                occurrences = occs,
                predictors = example_vars,
                buffer_type = "distance",
                dist_buf = 4,
                write_buffer = TRUE,
                clean_dupl = TRUE,
                clean_nas = TRUE,
                clean_uni = TRUE,
                png_sdmdata = TRUE,
                n_back = 1000,
                partition_type = "bootstrap",
                boot_n = 5,
                boot_proportion = 0.7
  )
}

for (i in 1:length(example_occs)) {
  sp <- species[i]
  do_many(species_name = sp,
          predictors = example_vars,
          models_dir = "~/modleR_test/forlooptest",
          png_partitions = TRUE,
          bioclim = TRUE,
          maxnet = FALSE,
          rf = TRUE,
          svmk = TRUE,
          svme = TRUE,
          brt = TRUE,
          glm = TRUE,
          domain = FALSE,
          mahal = FALSE,
          equalize = TRUE,
          write_bin_cut = TRUE)
}

for (i in 1:length(example_occs)) {
  sp <- species[i]
  final_model(species_name = sp,
              consensus_level = 0.5,
              models_dir = "~/modleR_test/forlooptest",
              which_models = c("raw_mean",
                               "bin_mean",
                               "bin_consensus"),
              uncertainty = TRUE,
              overwrite = TRUE)
}

for (i in 1:length(example_occs)) {
  sp <- species[i]
  occs <- example_occs[[i]]
  ensemble_model(species_name = sp,
                 occurrences = occs,
                 which_final = "bin_consensus",
                 png_ensemble = TRUE,
                 models_dir = "~/modleR_test/forlooptest")
}

Another option is to use the purrr package (Henry and Wickham 2017).

library(purrr)

example_occs %>% purrr::map2(.x = .,
                             .y = as.list(names(.)),
                             ~ setup_sdmdata(species_name = .y,
                                             occurrences = .x,
                                             partition_type = "bootstrap",
                                             boot_n = 5,
                                             boot_proportion = 0.7,
                                             clean_nas = TRUE,
                                             clean_dupl = TRUE,
                                             clean_uni = TRUE,
                                             buffer_type = "distance",
                                             dist_buf = 4,
                                             predictors = example_vars,
                                             models_dir = "~/modleR_test/temp_purrr",
                                             n_back = 1000))

species %>%
  as.list(.) %>%
  purrr::map(~ do_many(species_name = .,
                       predictors = example_vars,
                       models_dir = "~/modleR_test/temp_purrr",
                       bioclim = TRUE,
                       maxnet = FALSE,
                       rf = TRUE,
                       svme = TRUE,
                       svmk = TRUE,
                       domain = FALSE,
                       glm = TRUE,
                       mahal = FALSE,
                       brt = TRUE,
                       equalize = TRUE))
species %>%
  as.list(.) %>%
  purrr::map(~ final_model(species_name = .,
                           consensus_level = 0.5,
                           models_dir =  "~/modleR_test/temp_purrr",
                           which_models = c("raw_mean",
                                            "bin_mean",
                                            "bin_consensus"),
                           overwrite = TRUE))
example_occs %>% purrr::map2(.x = .,
                             .y = as.list(names(.)),
                             ~ ensemble_model(species_name = .y,
                                              occurrences = .x,
                                              which_final = "raw_mean",
                                              png_ensemble = TRUE,
                                              models_dir = "~/modleR_test/temp_purrr",
                                              overwrite = TRUE))

These workflows can also be paralellized by species or species algorithms

References

Araújo, M, and M New. 2007. “Ensemble Forecasting of Species Distributions.” *Trends in Ecology & Evolution* 22 (1): 42–47. .
Elith, J., J. R. Leathwick, and T. Hastie. 2009. “A Working Guide to Boosted Regression Trees.” *Journal of Animal Ecology* 77 (4): 802–13. .
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. Springer Heidelberg.
Henry, Lionel, and Hadley Wickham. 2017. “Purrr: Functional Programming Tools. R Package Version 0.2.4.”
Hijmans, Robert J., Steven Phillips, John Leathwick, and Jane Elith. 2017. “Dismo: Species Distribution Modeling. R Package Version 1.1-4.”
Karatzoglou, Alexandros, Alex Smola, Kurt Hornik, and Achim Zeileis. 2004. “Kernlab - An S4 Package for Kernel Methods in R.” *Journal of Statistical Software* 11 (9): 1–20.
Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” *R News* 2 (3): 18–22.
Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2017. “E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien.”