biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.
85 stars 22 forks source link

Help with BIOMOD_ModelingOptions - sampsize argument in RF #398

Closed kkougiou closed 5 months ago

kkougiou commented 9 months ago

Context and question I am using biomod2 version 4.2.4.

When I set the 'sampsize' argument to a numeric vector with two categories (0 and 1) having the same number (e.g., 60) in the BIOMOD_ModelingOptions function for RF and then try to run the BIOMOD_Modeling function, all my models fail.

The error I get is the following: 'Error in randomForest.default(m, y, ...) : sampsize can not be larger than class frequency'

That was not an issue in previous biomod2 versions (e.g., in version 4.1.2). Any help or suggestion will be greatly appreciated.

Thank you in advance for your time and efforts for solving this issue and for continuously updating and maintaining such a great and really useful R package.

Code used I used the vignette's data and I just changed the sampsize argument myBiomodData <- BIOMOD_FormatingData(resp.var = myResp, expl.var = myExpl, resp.xy = myRespXY, resp.name = myRespName)

Set the spsize argument

spsize <- c("0" = sum(myResp==1), "1" = sum(myResp==1))

Create default modeling options

myBiomodOptions <- BIOMOD_ModelingOptions(RF = list(ntree = 1000, sampsize = spsize))

Model single models

myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData, modeling.id = 'AllModels', models = c('RF'), bm.options = myBiomodOptions, CV.strategy = 'random', CV.nb.rep = 2, CV.perc = 0.8, metric.eval = c('TSS','ROC'), var.import = 1, seed.val = 42) }

show(myBiomodModelOut)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.models.out -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Modeling folder : .

Species modeled : GuloGulo

Modeling id : AllModels

Considered variables : bio3 bio4 bio7 bio11 bio12

Computed Models : none

Failed Models : GuloGulo_allData_RUN1_RF GuloGulo_allData_RUN2_RF

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Environment Information

sessionInfo()

R version 4.3.1 (2023-06-16 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.utf8 [2] LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8 [4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8

time zone: Europe/Athens tzcode source: internal

attached base packages: [1] parallel stats graphics grDevices utils datasets methods
[8] base

other attached packages: [1] blockCV_3.1-3 enmSdmX_1.1.2 flexsdm_1.3.3
[4] knitr_1.45 XML_3.99-0.14 rJava_1.0-6
[7] rgdal_1.6-7 vegan_2.6-4 permute_0.9-7
[10] classInt_0.4-10 doParallel_1.0.17 iterators_1.0.14
[13] foreach_1.5.2 PresenceAbsence_1.1.11 gtools_3.9.4
[16] rms_6.7-0 Hmisc_5.1-1 spatstat_3.0-6
[19] spatstat.linnet_3.1-1 spatstat.model_3.2-6 rpart_4.1.19
[22] spatstat.explore_3.2-3 nlme_3.1-162 spatstat.random_3.1-6 [25] spatstat.geom_3.2-5 spatstat.data_3.0-1 randomForest_4.7-1.1
[28] maptools_1.1-8 ecodist_2.1.3 adehabitatHR_0.4.21
[31] adehabitatLT_0.3.27 CircStats_0.2-6 boot_1.3-28.1
[34] MASS_7.3-60 adehabitatMA_0.3.16 ade4_1.7-22
[37] omnibus_1.2.7 MLmetrics_1.1.1 DescTools_0.99.50
[40] ecospat_4.0.0 modEvA_3.9.3 CalibratR_0.1.2
[43] Metrics_0.1.4 dismo_1.3-14 rasterVis_0.51.5
[46] lattice_0.21-8 biomod2_4.2-4 magrittr_2.0.3
[49] sfdep_0.2.3 lubridate_1.9.2 forcats_1.0.0
[52] stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2
[55] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
[58] ggplot2_3.4.4 tidyverse_2.0.0 sf_1.0-14
[61] terra_1.7-46 raster_3.6-23 sp_2.0-0

loaded via a namespace (and not attached): [1] maptpx_1.9-7 spatstat.sparse_3.0-2 httr_1.4.7
[4] RColorBrewer_1.1-3 repr_1.1.6 tools_4.3.1
[7] backports_1.4.1 utf8_1.2.4 R6_2.5.1
[10] clustMixType_0.3-9 mgcv_1.8-42 maxnet_0.1.4
[13] withr_2.5.2 gridExtra_2.3 tictoc_1.2
[16] progressr_0.14.0 quantreg_5.97 cli_3.6.1
[19] pacman_0.5.1 sandwich_3.0-2 labeling_0.4.3
[22] slam_0.1-50 mvtnorm_1.2-3 polspline_1.1.23
[25] proxy_0.4-27 ggstream_0.1.0 foreign_0.8-84
[28] dichromat_2.0-0.1 plotrix_3.8-4 maps_3.4.1
[31] readxl_1.4.3 rstudioapi_0.15.0 pals_1.8
[34] generics_0.1.3 xlsx_0.6.5 spdep_1.2-8
[37] Matrix_1.6-1 interp_1.1-4 fansi_1.0.5
[40] MetBrewer_0.2.0 abind_1.4-5 lifecycle_1.0.4
[43] multcomp_1.4-25 snakecase_0.11.1 grid_4.3.1
[46] predicts_0.1-8 cowplot_1.1.2 xlsxjars_0.6.1
[49] mapproj_1.2.11 pillar_1.9.0 gld_2.6.6
[52] xgboost_1.7.5.1 codetools_0.2-19 fastmatch_1.1-4
[55] phyloregion_1.0.8 wk_0.9.0 glue_1.6.2
[58] rcartocolor_2.1.1 data.table_1.14.8 spam_2.10-0
[61] vctrs_0.6.4 png_0.1-8 cellranger_1.1.0
[64] gtable_0.3.4 kernlab_0.9-32 TeachingDemos_2.12
[67] xfun_0.41 survival_3.5-5 Kendall_2.2.1
[70] fields_15.2 units_0.8-4 fitdistrplus_1.1-11
[73] TH.data_1.1-2 Rlof_1.1.3 smoothr_1.0.1
[76] KernSmooth_2.23-21 colorspace_2.1-0 spData_2.3.0
[79] DBI_1.1.3 nnet_7.3-19 gbm_2.1.8.1
[82] phangorn_2.11.1 Exact_3.2 tidyselect_1.2.0
[85] compiler_4.3.1 mda_0.5-4 htmlTable_2.4.2
[88] SparseM_1.81 expm_0.999-7 checkmate_2.3.0
[91] scales_1.2.1 hexbin_1.28.3 quadprog_1.5-8
[94] plotmo_3.6.2 digest_0.6.33 goftest_1.2-3
[97] spatstat.utils_3.0-3 rmarkdown_2.25 spatialEco_2.0-1
[100] htmltools_0.5.7 pkgconfig_2.0.3 jpeg_0.1-10
[103] base64enc_0.1-3 fastmap_1.1.1 rlang_1.1.1
[106] htmlwidgets_1.6.2 MoMAColors_0.0.0.9000 farver_2.1.1
[109] zoo_1.8-12 jsonlite_1.8.7 Formula_1.2-5
[112] dotCall64_1.1-0 s2_1.1.4 patchwork_1.1.3
[115] munsell_0.5.0 Rcpp_1.0.11 ape_5.7-1
[118] viridis_0.6.4 spThin_0.2.0 stringi_1.7.12
[121] pROC_1.18.4 elevatr_0.99.0 rootSolve_1.8.2.3
[124] plyr_1.8.8 earth_5.3.2 lmom_3.0
[127] deldir_1.0-9 splines_4.3.1 tensor_1.5
[130] hms_1.1.3 igraph_1.5.1 reshape2_1.4.4
[133] tidyterra_0.4.0 evaluate_0.23 latticeExtra_0.6-30
[136] tzdb_0.4.0 networkD3_0.4 MatrixModels_0.5-2
[139] polyclip_1.10-4 reshape_0.8.9 skimr_2.1.5
[142] janitor_2.2.0 e1071_1.7-13 viridisLite_0.4.2
[145] class_7.3-22 cluster_2.1.4 timechange_0.2.0
[148] ENMeval_2.0.4

HeleneBlt commented 9 months ago

Hello Kostas,

Thanks for reporting and for this reproducible issue 🙏. The problem here comes from the cross-validation: you're taking 0.8 percent of the data. So your sampsize should take this into account. For example: spsize <- c("0" = round(0.8*sum(myResp==1)), "1" = round(0.8*sum(myResp==1)))

I hope it helps. Don't hesitate if you still face the error.

Hélène

kkougiou commented 9 months ago

Hello Hélène,

Thank you for your prompt response! 😸

How would I need to change then the 'spsize' object if I used the bm_CrossValidation function with the block argument (or any other argument) instead? Would it be too much to ask to update the relevant help pages in the package?

Even if you don't have the time to answer, thank you once again for your time Hélène.

HeleneBlt commented 9 months ago

Hello,

BIOMOD_Modeling didn't allow (yet ?) internal down-sampling. But I would suggest that you have a look at the script of Farwe in the issue #393 👀 The calibration lines have been created with the bm_CrossValidation function first, so a list of parameters can be made and given to BIOMOD_Modeling.

RF_param_list <- NULL

for (cvrun in 1:nrow(calib_summary)) {

  prNum <- calib_summary$Presences[cvrun]
  bgNum <- calib_summary$True_Absences[cvrun]

  RF_param_list[[paste0("_",
                        calib_summary$PA[[cvrun]],
                        "_",
                        calib_summary$run[[cvrun]])]] =
    list(ntree = 1000,
         mtry = NULL,
         sampsize =  c("0" = prNum,
                       "1" = prNum),
         replace = TRUE)
}

But I'm afraid it will be necessary to switch to the 4.2-5 version from BiomodHub. You can install it with these commands :

library(devtools)
devtools::install_github("biomodhub/biomod2", dependencies = TRUE)

We know that down-sampled Random Forests interest a lot of people, so we will try to update the help pages as soon as possible. In the meantime, I think Farwe did a great job to explain it 🌟

Don't hesitate if you need more information, Hélène