Help with BIOMOD_FormatingData - setting different number of pseudo absences for different algorithms

Context and question Hello biomod2 team. I am setting the biomod data to use for modeling a species distribution, using the following algorithms: RF, GLM, GAM, GBM, MAXENT, MAXNET. I have set some model parameters using the lists, and I want also to tune some of them with the tuning function.

I would like to ask you a couple of advices:

How can I both change some model parameters using the list as in BigBoss function and the tuning function?
How can I create different set with different number of pseudoabsences, and to give a specific set to a specific algorithm? Thank you so much for now!

Code used

myBiomodData <- BIOMOD_FormatingData(resp.name = "Capriolo",
  resp.var = Capriolo_points,
  expl.var = Variables,
  PA.nb.rep = 3,
  PA.nb.absences = c(terra::nrow(Capriolo_points), 3*terra::nrow(Capriolo_points), 10000),
  PA.strategy = "random",
  filter.raster = T)
show(myBiomodData)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.formated.data -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

dir.name =  .

sp.name =  Capriolo.COMBO

     317 presences,  0 true absences and  11037 undefined points in dataset

     10 explanatory variables

 Extensive_agriFields    Forests       Intensive_agriFields Natural_grasslands
 Min.   :  0.00       Min.   :  0.00   Min.   :  0.000      Min.   :  0.000   
 1st Qu.:  0.00       1st Qu.:  0.00   1st Qu.:  0.000      1st Qu.:  0.000   
 Median :  0.00       Median : 29.10   Median :  0.000      Median :  0.000   
 Mean   : 15.24       Mean   : 42.58   Mean   : 18.454      Mean   :  4.765   
 3rd Qu.: 14.58       3rd Qu.: 96.78   3rd Qu.:  7.709      3rd Qu.:  0.000   
 Max.   :100.00       Max.   :100.00   Max.   :100.000      Max.   :100.000   
 Non_vegetated_areas   Shrublands       Urban_areas          slope        
 Min.   :  0.000     Min.   :  0.000   Min.   :  0.000   Min.   : 0.0000  
 1st Qu.:  0.000     1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.: 0.4152  
 Median :  0.000     Median :  0.000   Median :  0.000   Median : 6.0670  
 Mean   :  3.253     Mean   :  9.198   Mean   :  6.515   Mean   : 9.6545  
 3rd Qu.:  0.000     3rd Qu.:  0.000   3rd Qu.:  0.000   3rd Qu.:17.1258  
 Max.   :100.000     Max.   :100.000   Max.   :100.000   Max.   :46.2868  
     aspect            Altitude       
 Min.   :  0.0054   Min.   :  -6.175  
 1st Qu.:120.8982   1st Qu.:  76.421  
 Median :180.8176   Median : 464.343  
 Mean   :180.0492   Mean   : 632.902  
 3rd Qu.:241.1506   3rd Qu.:1095.927  
 Max.   :359.9886   Max.   :2647.736  

 3 Pseudo Absences dataset available ( PA1, PA2, PA3 ) with  
328 (PA1), 984 (PA2), 10000 (PA3) pseudo absences

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

show(myExpl) 
class       : SpatRaster 
dimensions  : 270, 294, 10  (nrow, ncol, nlyr)
resolution  : 500, 500  (x, y)
extent      : 4490155, 4637155, 2493842, 2628842  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (EPSG:3035) 
sources     : Extensive_agriFields.tif  
              Forests.tif  
              Intensive_agriFields.tif  
              ... and 6 more source(s)
names       : Exten~ields, Forests, Inten~ields, Natur~lands, Non_v~areas, Shrublands, ... 
min values  :           0,       0,           0,           0,           0,          0, ... 
max values  :         100,     100,         100,         100,         100,        100, ...

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

user.MAXENT <- list('_allData_allRun' = list(
  path_to_maxent.jar = getwd(),
  memory_allocated = NULL
))

OptionsBigboss
user.val <- list(MAXENT.binary.MAXENT.MAXENT = user.MAXENT)

myOptions <- bm_ModelingOptions(data.type = 'binary',
                                models = Selected_algos,
                                strategy = 'user.defined',
                                user.val = user.val,
                                user.base = 'bigboss')
myOptions

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.models.options -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

>  CTA options (datatype: binary , package: rpart , function: rpart ) :
   ( dataset _allData_allRun )
    -  formula = 
    -  data = 
    -  weights = 
    -  subset = 
    -  na.action = na.rpart
    -  method = "class"
    -  model = FALSE
    -  x = FALSE
    -  y = TRUE
    -  parms = 
    -  control = $xval 5  $minbucket 5  $minsplit 5  $cp 0.001  $maxdepth 25    (default:  )
    -  cost = 

>  FDA options (datatype: binary , package: mda , function: fda ) :
   ( dataset _allData_allRun )
    -  formula = formula(data)
    -  data = sys.frame(sys.parent())
    -  weights = 
    -  theta = 
    -  eps = .Machine$double.eps
    -  method = "mars"   (default: polyreg )

>  GAM options (datatype: binary , package: mgcv , function: gam ) :
   ( dataset _allData_allRun )
    -  formula = 
    -  family =  Family: binomial  Link function: logit  
    -  data = list()
    -  na.action = 
    -  method = "GCV.Cp"
    -  optimizer = c("outer", "newton")
    -  control = $epsilon 1e-06  $trace FALSE  $maxit 100    (default: $nthreads 1  $ncv.threads 1  $irls.reg 0  $epsilon 1e-07  $maxit 200  $trace FALSE  $mgcv.tol 1e-07  $mgcv.half 15  $rank.tol 1.490116e-08  $nlm $nlm$ndigit 7  $nlm$gradtol 1e-06  $nlm$stepmax 2  $nlm$steptol 1e-04  $nlm$iterlim 200  $nlm$check.analyticals FALSE   $optim $optim$factr 1e+07   $newton $newton$conv.tol 1e-06  $newton$maxNstep 5  $newton$maxSstep 2  $newton$maxHalf 30  $newton$use.svd FALSE   $idLinksBases TRUE  $scalePenalty TRUE  $efs.lspmax 15  $efs.tol 0.1  $keepData FALSE  $scale.est "fletcher"  $edge.correct FALSE  )
    -  scale = 0
    -  select = FALSE
    -  gamma = 1
    -  fit = TRUE
    -  drop.unused.levels = TRUE
    -  discrete = FALSE

>  GBM options (datatype: binary , package: gbm , function: gbm ) :
   ( dataset _allData_allRun )
    -  formula = formula(data)
    -  distribution = "bernoulli"
    -  data = list()
    -  weights = 
    -  n.trees = 2500   (default: 100 )
    -  interaction.depth = 7   (default: 1 )
    -  n.minobsinnode = 5   (default: 10 )
    -  shrinkage = 0.001   (default: 0.1 )
    -  bag.fraction = 0.5
    -  train.fraction = 1
    -  cv.folds = 3   (default: 0 )
    -  keep.data = FALSE   (default: TRUE )
    -  verbose = FALSE
    -  n.cores = 1   (default: NULL )

>  GLM options (datatype: binary , package: stats , function: glm ) :
   ( dataset _allData_allRun )
    -  formula = 
    -  family =  Family: binomial  Link function: logit  
    -  data = 
    -  weights = 
    -  subset = 
    -  na.action = 
    -  etastart = 
    -  mustart = 0.5   (default:  )
    -  offset = 
    -  control = $epsilon 1e-08  $maxit 50  $trace FALSE    (default: list() )
    -  model = TRUE
    -  method = "glm.fit"
    -  x = FALSE
    -  y = TRUE
    -  singular.ok = TRUE

>  MAXENT options (datatype: binary , package: MAXENT , function: MAXENT ) :
   ( dataset _allData_allRun )
    -  path_to_maxent.jar = "C:/Users/Loren/Desktop/SDM_Variables_prey/Capriolo/5.NOSTRI_GBIF_INFO_tot"
    -  background_data_dir = "default"
    -  visible = FALSE
    -  linear = TRUE
    -  quadratic = TRUE
    -  product = TRUE
    -  threshold = TRUE
    -  hinge = TRUE
    -  lq2lqptthreshold = 80
    -  l2lqthreshold = 10
    -  hingethreshold = 15
    -  beta_threshold = -1
    -  beta_categorical = -1
    -  beta_lqp = -1
    -  beta_hinge = -1
    -  betamultiplier = 1
    -  defaultprevalence = 0.5

>  MAXNET options (datatype: binary , package: maxnet , function: maxnet ) :
   ( dataset _allData_allRun )
    -  p = 
    -  data = 
    -  regmult = 1
    -  regfun = maxnet.default.regularization
    -  addsamplestobackground = T

>  RF options (datatype: binary , package: randomForest , function: randomForest ) :
   ( dataset _allData_allRun )
    -  mtry = 1
    -  type = "classification"
    -  ntree = 500   (default: NULL )
    -  strata = 0 1 Levels: 0 1   (default: NULL )
    -  nodesize = 5   (default: NULL )

>  XGBOOST options (datatype: binary , package: xgboost , function: xgboost ) :
   ( dataset _allData_allRun )
    -  missing = NA
    -  params = $max_depth 2  $eta 1    (default: list() )
    -  nrounds = 4
    -  verbose = 1
    -  print_every_n = 1
    -  save_name = "xgboost.model"
    -  callbacks = list()
    -  nthread = 2   (default: NULL )
    -  objective = "binary:logistic"   (default: NULL )

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

sessionInfo()

R version 4.4.0 (2024-04-24 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=Italian_Italy.utf8  LC_CTYPE=Italian_Italy.utf8   
[3] LC_MONETARY=Italian_Italy.utf8 LC_NUMERIC=C                  
[5] LC_TIME=Italian_Italy.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] CoordinateCleaner_3.0.1 biosurvey_0.1.2         xgboost_1.7.7.1        
 [4] randomForest_4.7-1.1    maxnet_0.1.4            earth_5.3.3            
 [7] plotmo_3.6.3            plotrix_3.8-4           Formula_1.2-5          
[10] gbm_2.1.9               mgcv_1.9-1              nlme_3.1-164           
[13] gam_1.22-3              foreach_1.5.2           mda_0.5-4              
[16] class_7.3-22            rpart_4.1.23            nnet_7.3-19            
[19] biomod2_4.2-5           usdm_2.1-7              data.table_1.15.4      
[22] terra_1.7-78            rgbif_3.8.0            

loaded via a namespace (and not attached):
 [1] DBI_1.2.3              deldir_2.0-4           pROC_1.18.5           
 [4] permute_0.9-7          rlang_1.1.4            magrittr_2.0.3        
 [7] ENMeval_2.0.4          e1071_1.7-14           compiler_4.4.0        
[10] spatstat.geom_3.2-9    vctrs_0.6.5            maps_3.4.2            
[13] reshape2_1.4.4         stringr_1.5.1          pkgconfig_2.0.3       
[16] shape_1.4.6.1          utf8_1.2.4             pracma_2.4.4          
[19] glmnet_4.1-8           jsonlite_1.8.8         PresenceAbsence_1.1.11
[22] reshape_0.8.9          spatstat.utils_3.0-4   parallel_4.4.0        
[25] cluster_2.1.6          R6_2.5.1               stringi_1.8.4         
[28] spatstat.data_3.0-4    diptest_0.77-1         Rcpp_1.0.12           
[31] iterators_1.0.14       snow_0.4-4             picante_1.8.2         
[34] Matrix_1.7-0           tidyselect_1.2.1       rnaturalearth_1.0.1   
[37] rstudioapi_0.16.0      abind_1.4-5            vegan_2.6-6.1         
[40] codetools_0.2-20       lattice_0.22-6         tibble_3.2.1          
[43] plyr_1.8.9             withr_3.0.0            ks_1.14.2             
[46] geosphere_1.5-18       survival_3.6-4         sf_1.0-16             
[49] units_0.8-5            proxy_0.4-27           polyclip_1.10-6       
[52] xml2_1.3.6             mclust_6.1.1           pillar_1.9.0          
[55] whisker_0.4.1          KernSmooth_2.23-24     generics_0.1.3        
[58] sp_2.1-4               ggplot2_3.5.1          munsell_0.5.1         
[61] scales_1.3.0           glue_1.7.0             lazyeval_0.2.2        
[64] tools_4.4.0            mvtnorm_1.2-5          grid_4.4.0            
[67] ape_5.8                colorspace_2.1-0       raster_3.6-26         
[70] cli_3.6.2              fansi_1.0.6            dplyr_1.1.4           
[73] doSNOW_1.0.20          gtable_0.3.5           oai_0.4.0             
[76] digest_0.6.35          classInt_0.4-10        lifecycle_1.0.4       
[79] dismo_1.3-14           httr_1.4.7             MASS_7.3-60.2

Hello Lorenzo,

How can I both change some model parameters using the list as in BigBoss function and the tuning function?

You can check the Modeling options vignette made by Hélène : the User defined section presents you exactly what you want I think :slightly_smiling_face:

Tuning some parameters with bm_Tuning function
Setting modeling options with these tuned parameters and bigboss options as base, through bm_ModelingOptions function (or directly within BIOMOD_Modeling) and strategy = 'user.defined'

In your case, as you want also to modify some parameters of MAXENT, you can do that between steps 1 and 2, by adding your user.MAXENT element to the list of options to give to user.val parameter.

How can I create different set with different number of pseudoabsences, and to give a specific set to a specific algorithm?

You did a great job so far within the BIOMOD_FormatingData function ! Now, the last step is to use the models.pa parameter within the BIOMOD_Modeling function. So in your case, you created 3 sets of pseudo-absences, and you want to attribute them to each algorithm, and here is a random example :

models.pa <- list(RF = "PA1"
                  , GLM = c("PA2", "PA3")
                  , GAM= c("PA2", "PA3")
                  , GBM = "PA1"
                  , MAXENT = "PA3"
                  , MAXNET = "PA3")

Please, do not hesitate if you need more details :slightly_smiling_face:

Maya

PS : this issue post was really perfect and well organized :star_struck: :heart:

Dear @MayaGueguen , Thanks for your answer, that's really helpful!

I updated my script to change some parameter of other models and to assign different sets of PAs to different algorithms. At the end of this issue, you can find the updated version of my script. However, when I run the BIOMOD_EnsembleModeling function I get the following error:

Evaluating Model stuff...obs and fit are not the same length  => model evaluation skipped !obs and fit are not the same length  => model evaluation skipped !
Errore in { : 
  task 1 failed - "task 1 failed - "task 1 failed - "argomento di lunghezza 0"""
In aggiunta: Messaggi di avvertimento:
1: In BIOMOD_EnsembleModeling(bm.mod = Capreolus_single_models, models.chosen = "all",  :
  Parallelisation with `foreach` is not available for Windows. Sorry.
2: In cross.validation$validation <- NA :
  Trasformo il membro di sinistra in una lista

Selected_algos <- c("GAM", "GBM", "GLM",
                    "MAXENT", "MAXNET", "RF", "XGBOOST")

myBiomodData <- BIOMOD_FormatingData(
  resp.name = "Capriolo_COMBO_",
  resp.var = Capriolo_points,
  expl.var = Variables,
  PA.nb.rep = 3,
  PA.nb.absences = c(terra::nrow(Capriolo_points), 3*terra::nrow(Capriolo_points), 10000),
  PA.strategy = "random",
  filter.raster = T
)
myBiomodData
myBiomodData@PA.table
show(Variables)

# Selected_algos <- "MAXENT"
user.RF <- list('_allData_allRun' = list(
  mtry = "default",
  ntree = 1000,
  nodesize = 10,
  maxnodes = 5
))
user.MAXENT <- list('_allData_allRun' = list(
  path_to_maxent.jar = getwd()
))
user.GAM <- list('_allData_allRun' = list(
  algo = "GAM_mgcv"
))

OptionsBigboss
ModelsTable
user.val <- list(MAXENT.binary.MAXENT.MAXENT = user.MAXENT,
                 RF.binary.randomForest.randomForest = user.RF,
                 GAM.binary.mgcv.gam = user.GAM)

myOptions <- bm_ModelingOptions(data.type = 'binary',
                                models = Selected_algos,
                                strategy = 'user.defined',
                                user.val = user.val,
                                user.base = 'bigboss')
myOptions
# biomod2::plot(BIOMOD_data)

Capreolus_single_models <- BIOMOD_Modeling(
  bm.format = myBiomodData,
  modeling.id = "Single.models",
  models = Selected_algos,
  models.pa = list(
    GAM = c("PA2","PA3"),
    GLM = c("PA2","PA3"),
    GBM = c("PA1","PA2"),
    RF = c("PA1","PA2"),
    MAXENT = c("PA2","PA3"),
    MAXNET = c("PA2","PA3"),
    XGBOOST = c("PA2","PA3")),
  CV.strategy = "block",
  # CV.nb.rep = 10,
  # CV.perc = 0.65,
  CV.do.full.models = F,
  bm.options = myOptions,
  metric.eval = c("ROC", "TSS"),
  var.import = 1,
  nb.cpu = 4,
  do.progress = T
)

Capreolus_ensemble_models <- BIOMOD_EnsembleModeling(
  bm.mod = Capreolus_single_models,
  models.chosen = "all",
  em.by = "all",
  em.algo = "EMwmean",
  metric.select = c("ROC"),
  metric.select.thresh = c(0.7),
  metric.eval = c("ROC", "TSS"),
  var.import = 3,
  nb.cpu = 432,
  do.progress = T
)

Glad that you are coming to the next step !

I'm going to need a bit more insight for this one. Could you send me also all the prints you get when running the BIOMOD_EnsembleModeling function please ? (all text that is printed into your R command) :eyes:

Here you can find the prints I get when running BIOMOD_EnsembleModeling

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Build Ensemble Models -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

   ! all models available will be included in ensemble.modeling
  ! Ensemble Models will be filtered and/or weighted using validation dataset (if possible). Please use `metric.select.dataset` for alternative options.
   > Evaluation & Weighting methods summary :
      ROC over 0.7

  > mergedData_mergedRun_mergedAlgo ensemble modeling
   ! Additional projection required for ensemble models merging several pseudo-absence dataset...

          original models scores =  0.714 0.71
          final models weights =  0.501 0.499
   > Probabilities weighting mean by ROC ...
            Evaluating Model stuff...obs and fit are not the same length  => model evaluation skipped !obs and fit are not the same length  => model evaluation skipped !Errore in { : 
  task 1 failed - "task 1 failed - "task 1 failed - "argomento di lunghezza 0"""
In aggiunta: Messaggi di avvertimento:
1: In BIOMOD_EnsembleModeling(bm.mod = Capreolus_single_models, models.chosen = "all",  :
  Parallelisation with `foreach` is not available for Windows. Sorry.
2: In cross.validation$validation <- NA :
  Trasformo il membro di sinistra in una lista

Thank you for the prints !

Would you mind sending me the following objects (Capriolo_points, Variables) so I can replicate the issue ? :pray: Either here or by email to maya.gueguen [at] univ-grenoble-alpes.fr

Hi @MayaGueguen ,

I just sent you the two files you asked me. Unfortunately they are in separate e-mails (I forgot to attach a file in the first e-mail, sorry).. Thank you in advance for all your help, have a nice day!

Lorenzo Bernicchi

Thank you Lorenzo for the data ! :pray: It helped :slightly_smiling_face:

I just pushed a commit that should correct the problem. Please have a try, and do not hesitate if you encounter new issues !

Maya

Hello @MayaGueguen , thanks for correcting the problem, now it works fine!

However, I have another questions related to algorithms tuning. I want to tune the algorithms I will use, and I followed the format you suggested in the issue #415. I produce the following script. May I ask you if there are any problems or potential issues? I noticed that it takes a lot of time to run each bm_Tuning function (even hours for each algorithms, is it normal?), so I would like to know if I will face any problem before waiting for a long time.

Here the script:

Selected_algos <- c("GAM", "GBM", "GLM",
                    "MAXENT", "MAXNET", "RF", "XGBOOST")

myBiomodData <- BIOMOD_FormatingData(
  resp.name = "Capriolo",
  resp.var = Capriolo_points,
  expl.var = Variables,
  PA.nb.rep = 12,
  PA.nb.absences = c(
    rep(terra::nrow(Capriolo_points),4), 
    rep(3*terra::nrow(Capriolo_points),4), 
    rep(10000, 4)),
  PA.strategy = "random",
  filter.raster = T
)

block_CV <- bm_CrossValidation(
  bm.format = myBiomodData,
  strategy = "block")

default_options <- bm_ModelingOptions(data.type = 'binary',
                                      models = Selected_algos,
                                      strategy = 'bigboss',    
                                      bm.format = Capriolo_points,
                                      calib.lines = block_CV)
tuned.RF <- bm_Tuning(model = "RF",
                      tuning.fun = "rf",
                      do.formula = T,
                      bm.options = default_options@options$RF.binary.randomForest.randomForest,
                      bm.format = myBiomodData,
                      calib.lines = block_CV[, grep(paste0("PA", 1:8, collapse = "|"), colnames(block_CV))])

for(n in names(tuned.RF)){
  tuned.RF[[n]] <- c(tuned.RF[[n]], nodesize = 15)
}
for(n in names(tuned.RF)){
  tuned.RF[[n]] <- c(tuned.RF[[n]], ntree = 10)
}
for(n in names(tuned.RF)){
  tuned.RF[[n]] <- c(tuned.RF[[n]], maxnodes = 5)
}

tuned.GAM <- bm_Tuning(model = "GAM",
                       tuning.fun = "gam",
                       do.formula = T,
                       bm.options = default_options@options$GAM.binary.mgcv.gam,
                       bm.format = myBiomodData,
                       calib.lines = block_CV[, grep(paste0("PA", 5:12, collapse = "|"), colnames(block_CV))]
                       )

for(n in names(tuned.GAM)){
  tuned.GAM[[n]] <- c(tuned.GAM[[n]], algo = "GAM_mgcv")
}

tuned.MAXENT <- bm_Tuning(model = "MAXENT",
                          tuning.fun = "ENMevaluate",
                          bm.options = default_options@options$MAXENT.binary.MAXENT.MAXENT,
                          bm.format = myBiomodData,
                          calib.lines = block_CV[, grep(paste0("PA", 5:12, collapse = "|"), colnames(block_CV))],
                          params.train = list(
                          tune.args=list(fc = c("L","LQ","LQH","H","LQHP","LQHPT"),
                          rm = seq(1,5,0.5)),
                          partitions = "randomkfold",
                          algorithm = "maxent.jar",
                          partition.settings = list(kfolds=10),
                          parallel = F
                          ))

tuned.GBM <- bm_Tuning(model = "GBM",
                       tuning.fun = "gbm",
                       do.formula = T,
                       bm.options = default_options@options$GBM.binary.gbm.gbm,
                       bm.format = myBiomodData,
                       calib.lines = block_CV[, grep(paste0("PA", 1:8, collapse = "|"), colnames(block_CV))])

tuned.GLM <- bm_Tuning(model = "GLM",
                       tuning.fun = "glm",
                       do.formula = T,
                       bm.options = default_options@options$GLM.stats.glm.glm,
                       bm.format = myBiomodData,
                       calib.lines = block_CV[, grep(paste0("PA", 5:12, collapse = "|"), colnames(block_CV))])

tuned.MAXNET <- bm_Tuning(model = "MAXNET",
                          tuning.fun = "maxnet",
                          do.formula = T,
                          bm.options = default_options@options$MAXNET.binary.maxnet.maxnet,
                          bm.format = myBiomodData,
                          calib.lines = block_CV[, grep(paste0("PA", 5:12, collapse = "|"), colnames(block_CV))])

tuned.XGBOOST <- bm_Tuning(model = "XGBOOST",
                           tuning.fun = "xgbTree",
                           do.formula = T,
                           bm.options = default_options@options$XGBOOST.binary.xgboost.xgboost,
                           bm.format = myBiomodData,
                           calib.lines = block_CV[, grep(paste0("PA", 1:12, collapse = "|"), colnames(block_CV))])

user.val <- list(
  RF.binary.randomForest.randomForest = tuned.RF,
  MAXENT.binary.MAXENT.MAXENT = tuned.MAXENT,
  GAM.binary.mgcv.gam = tuned.GAM,
  GBM.binary.gbm.gbm = tuned.GBM,
  GLM.stats.glm.glm = tuned.GLM,
  MAXNET.binary.maxnet.maxnet = tuned.MAXNET,
  XGBOOST.binary.xgboost.xgboost =tuned.XGBOOST
)

myOptions <- bm_ModelingOptions(
  data.type = 'binary',
  models = Selected_algos,
  strategy = 'user.defined',
  user.val = user.val,
  user.base = 'bigboss',
  bm.format = myBiomodData,
  calib.lines = block_CV
)

myOptions

Capreolus_single_models <- BIOMOD_Modeling(
  bm.format = myBiomodData,
  modeling.id = "Single.models",
  models = Selected_algos,
  CV.strategy = "block",
  # CV.nb.rep = 10,
  # CV.perc = 0.65,
  CV.do.full.models = F,
  OPT.user = myOptions,
  metric.eval = c("ROC", "TSS"),
  var.import = 1,
  nb.cpu = 4,
  do.progress = T
)

Capreolus_ensemble_models <- BIOMOD_EnsembleModeling(
  bm.mod = Capreolus_single_models,
  models.chosen = "all",
  em.by = "all",
  em.algo = "EMwmean",
  metric.select = c("ROC"),
  metric.select.thresh = c(0.7),
  metric.eval = c("ROC", "TSS"),
  var.import = 3,
  nb.cpu = 432,
  do.progress = T
)

EDIT I was too curios, so I tried the code anyway. When MAXENTtuning starts, I get this warning message:

> Dataset _PA6_RUN2
            > Tuning parameters...*** Running initial checks... ***

* Variable values were input along with coordinates and not as raster data, so no raster predictions can be generated and AICc is calculated with background data for Maxent models.
* Model evaluations with random 10-fold cross validation...

*** Running ENMeval v2.0.4 with maxnet from maxnet package v0.1.4 ***

Hello Lorenzo,

Good to hear that the correction is working 👍

As for the tuning, unfortunately, it is long... And the more data and the more cross-validation sets, the longer. Even if some models are taking less time than other (GBM is very long for example), globally it is quite time consuming.

As for the warning for MAXENT, we are using the function ENMevaluate from the ENMeval package so I'm not a lot familiar with how it works.

But your code seems okay 🙂

Maya

Hi @MayaGueguen ,

Thanks for the info, I will try to have a look and see what will happen!

Just another quick question.. I would like to implement down-sampled RF (as in the issue #393). Are both my codes (with and without model tuning) doing the same thing by setting a PAs dataset with many points as the presences? Or should I use something like

RF_param_list <-
  list("_allData_allRun" =
         list(ntree = 1000,
              sampsize =  c("NA" = 333,
                            "1" = 333),
              replace = TRUE))

Using NA since I will use PAs

Thank you very much, have a nice day!!

Hello Lorenzo :wave:

For the down-sampled RF question, you must put 0 even if you don't have real absences, it will take into account pseudo-absences. Note that we will soon be ready to switch to version 4.2-6 on github which will contain a new single model named RFd computing down-sampled RF without having to specify options for basic RF like we do now :slightly_smiling_face:

Also, if you want to reduce a bit your computing time for tuning :

in your code, you use the calib.lines argument when calling to bm_Tuning, which leads to tuning all combination of pseudo-absence dataset and calibration sets
if you don't use the calib.lines, tuning will be made only for each of your pseudo-absence dataset (_PA1_allRun, _PA2_allRun, etc), and when giving it to bm_ModelingOptions, parameters tuned for a specific PA dataset (for example PA1) will be attributed to all related calibration sets (_PA1_RUN1, _PA1_RUN2, etc)

Maya

Hello @MayaGueguen ,

good to know for the 4.2-6 version, I will look forward for it!

About the suggestion you gave me: should I drop this code line calib.lines = block_CV[, grep(paste0("PA", 1:8, collapse = "|"), colnames(block_CV))] within each bm_Tuning function? I don't know if I correctly understood your question! I inserted this code line to specify which PAs sets to use in each algorithm, but I could specify it also with the models.pa parameter, am I wrong?

So, in the end, each of my bm_Tuning function will look like this:

tuned.RF <- bm_Tuning(model = "RF",
                      tuning.fun = "rf",
                      do.formula = T,
                      bm.options = default_options@options$RF.binary.randomForest.randomForest,
                      bm.format = myBiomodData,
                      params.train = list(
                        nodesize = 25,
                        ntree = 2500,
                        maxnodes = 10,
                        sampsize = c("1" = min(calib.summary$Presences),
                                     "0" = min(calib.summary$Pseudo_Absences)) #not NA given what you specified above
                      ))

while the BIOMOD_Modeling will be like this to assign specific PA datasets to a specific agorithm:

Capreolus_single_models <- BIOMOD_Modeling(
  bm.format = myBiomodData,
  modeling.id = "Single.models",
  models = Selected_algos,
  models.pa = list(
  RF = c("PA1","PA2")),
  CV.strategy = "block",
  # CV.nb.rep = 10,
  # CV.perc = 0.65,
  CV.do.full.models = F,
  bm.options = myOptions,
  metric.eval = c("ROC", "TSS"),
  var.import = 1,
  nb.cpu = 4,
  do.progress = T
)

Please correct me if I am wrong. Thanks for now! Have a nice day

Sorry, I might have not been very clear. Here is an example forRF in your study case below :eyes:

Selected_algos <- c("GAM", "GBM", "GLM",
                    "MAXENT", "MAXNET", "RF", "XGBOOST")

myBiomodData <- BIOMOD_FormatingData(
  resp.name = "Capriolo",
  resp.var = Capriolo_points,
  expl.var = Variables,
  PA.nb.rep = 12,
  PA.nb.absences = c(
    rep(terra::nrow(Capriolo_points),4), 
    rep(3*terra::nrow(Capriolo_points),4), 
    rep(10000, 4)),
  PA.strategy = "random",
  filter.raster = T
)

calib.summary <- summary(myBiomodData) ## ADDED

block_CV <- bm_CrossValidation(
  bm.format = myBiomodData,
  strategy = "block")

default_options <- bm_ModelingOptions(data.type = 'binary',
                                      models = Selected_algos,
                                      strategy = 'bigboss',    
                                      bm.format = myBiomodData) ## CHANGED
tuned.RF <- bm_Tuning(model = "RF",
                      tuning.fun = "rf",
                      do.formula = TRUE,
                      bm.options = default_options@options$RF.binary.randomForest.randomForest,
                      bm.format = myBiomodData) ## REMOVED calib.lines

## CHANGED
for(n in names(tuned.RF)){
  tuned.RF[[n]][["nodesize"]] = 25
  tuned.RF[[n]][["ntree"]] = 2500
  tuned.RF[[n]][["maxnodes"]] = 10
  tuned.RF[[n]][["sampsize"]] = c("1" = min(calib.summary$Presences),
                                  "0" = min(calib.summary$Pseudo_Absences)) #not NA given what you specified above
}

Dear @MayaGueguen ,

I tuned all my algorithms and everything went perfect. However, when I arrived at the BIOMOD_Projection() function I get the following error:

Errore in .Call(list(name = "CppField__get", address = <pointer: (nil)>,  : 
  Valore NULL passato come indirizzo simbolo

All my code and data remained the same as before.

Thanks again!

Hello Lorenzo,

Glad that tuning went well 👍 Could you check if everything is okay with your variables you give to BIOMOD_FormatingData or BIOMOD_Projection ? It seems like it could be an error linked to terrapackage or objects 👀

Maya

Hello again @MayaGueguen ,

the problem was with the variables object, that I used to store all my raster variables. I re-created it and everything went well! I think the problem was with the saving and uploading of the environment.

Thank you so much for everything, I really appreciate your help!

Have a nice day, all the best!

biomodhub / biomod2

Help with BIOMOD_FormatingData - setting different number of pseudo absences for different algorithms #472