biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.
87 stars 22 forks source link

Help with BIOMOD_4.2-5-2- [tune/train/test/evaluation data] #526

Open yuliaUU opened 4 days ago

yuliaUU commented 4 days ago

Hi

I have more theoretical question. Ideally what I would like to do is to split data in 3 parts: one used for tuning, other for testing and last one for evaluation. to reproduce:

library(terra)
library(tidyverse)
library(biomod2)

data(DataSpecies)
DataSpecies<- DataSpecies |> select(X_WGS84,Y_WGS84,GuloGulo) |> mutate(GuloGulo=1) |> drop_na()
# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies$GuloGulo)

# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]

# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
data(bioclim_current)
myExpl <- terra::rast(bioclim_current)

based on all the documentations and issues posted by others, I so far figured the tunning part (but still not sure how can I ensure that data I used in tunning is not part of the modelling process). SO for now i am just trying to make it work on the same data:

myBiomodData <- BIOMOD_FormatingData(
  resp.name = "name",
  resp.var = myResp,
  resp.xy = myRespXY,
  expl.var = myExpl,
  PA.nb.rep = 1,
  PA.nb.absences = nb,
  PA.strategy <- 'disk'
  PA.dist.min <- 80000
  PA.dist.max <- 1000000
  filter.raster = TRUE,
  dir.name = 'SDM'
)
# Do CV
cv.k <- bm_CrossValidation(bm.format = myBiomodData,
                           strategy = 'kfold',
                           nb.rep = 2,
                           k = 3)
### Tune Models of the models ####
d.opt <- bm_ModelingOptions(
  data.type = "binary",
  models = c("GLM", "GAM.mgcv.gam", "RF", "MAXENT", "MAXNET", "XGBOOST"),
  strategy = "default",
  calib.lines = cv.k,
  bm.format = myBiomodData
)
# Tuning for RF model
tuned.rf <- bm_Tuning(model = 'RF',
                      tuning.fun = 'rf',
                      do.formula = TRUE,
                      bm.options = d.opt@options$RF.binary.randomForest.randomForest,
                      bm.format = myBiomodData)

# Tuning for XGBOOST
tuned.xgboost <- bm_Tuning(model = "XGBOOST",
                           tuning.fun = "xgbTree",
                           do.formula = TRUE,
                           bm.options = d.opt@options$XGBOOST.binary.xgboost,
                           bm.format = myBiomodData)

# Tuning for MAXENT model
tuned.maxent <- bm_Tuning(model = "MAXENT",
                          tuning.fun = "maxnet",  # Use the correct tuning function
                          do.formula = TRUE,
                          bm.options = d.opt@options$MAXENT.binary.MAXENT,
                          bm.format = myBiomodData,
                          metric.eval = "auc.val.avg"  # Specify a valid evaluation metric
)
user.val <- list(
  RF.binary.randomForest.randomForest = tuned.rf,
  MAXENT.binary.MAXENT.MAXENT = tuned.maxent,  
  XGBOOST.binary.xgboost.xgboost = tuned.xgboost  
)
# Define the native models list
NATIVERG <-   models = c("GLM", "GAM.mgcv.gam", "RF", "MAXENT", "MAXNET", "XGBOOST"),

# Create the BIOMOD options
myBiomodOption <-  bm_ModelingOptions(data.type = 'binary',
                                      models = NATIVERG,
                                      strategy = 'user.defined',
                                      user.val = user.val,
                                      user.base = 'bigboss',
                                      calib.lines = cv.k,
                                      bm.format = myBiomodData)
### Individual Model Creation ###
myBiomodModelOut <- BIOMOD_Modeling(
  bm.format = myBiomodData,
  modeling.id = "TST",
  models = NATIVERG,
  bm.options = myBiomodOptions,
  CV.strategy = 'random',
  CV.nb.rep = 3,
  CV.perc = 0.7,
  metric.eval = c('TSS', 'ROC'),
  var.import = 0,
  do.progress = FALSE,
  do.full.models=FALSE,
  nb.cpu = 2
)
EVALMODEL <- get_evaluations(myBiomodModelOut)

thank you a lot for your help!

MayaGueguen commented 18 hours ago

Hello Yulia,

Thank you for your detailed question. It is indeed a case that is not often asked, so I'll try and suggest you a way to make it work 🙂

I would say that the simplest way to do that, is to split your dataset in 3 parts (that I will call after tuning, modeling and evaluation), before giving it to any biomod2 functions, and then to plan your modelling scheme in two steps :

  1. Creating a BIOMOD.formated.data only with your tuning data, use this object to tune your models and retrieve the modeling options obtained
  2. Creating a secund BIOMOD.formated.data object, giving it on one side the data for modeling, and on the other side the data for evaluating, and use the user.defined strategy to define modeling options.

⚠️ Please consider that it means :

⚠️ So I would advise to give another thorough thought to be sure you understand all your simulation choices !

If it is still something you consider to do, here is in practice how it could be done :

## ------------- TUNING -------------  ## 
myDataTun<- BIOMOD_FormatingData(
  resp.name = "name",
  resp.var = myResp.tun,
  resp.xy = myRespXY.tun,
  expl.var = myExpl,
  PA.nb.rep = 1,
  PA.nb.absences = 1000,
  PA.strategy = 'random',
  filter.raster = TRUE,
  dir.name = 'SDM'
)

# Tune models
optTun <- bm_ModelingOptions(
  data.type = "binary",
  models = c("RF", "MAXENT", "XGBOOST"),
  strategy = "tuned",
  bm.format = myDataTun
)

# Get other models options
optBB <- bm_ModelingOptions(
  data.type = "binary",
  models = c("GLM", "GAM.mgcv.gam", "MAXNET"),
  strategy = "bigboss",
  bm.format = myDataTun
)

# Gather options
user.val <- c(optTun@options, optBB@options)

## ------------- MODELING -------------  ## 
myDataMod<- BIOMOD_FormatingData(
  resp.name = "name",
  resp.var = myResp.mod,
  resp.xy = myRespXY.mod,
  expl.var = myExpl,
  eval.resp.var = myResp.eval,
  eval.resp.xy = myRespXY.eval,
  eval.expl.var = myExpl,
  PA.nb.rep = 1,
  PA.nb.absences = 1000,
  PA.strategy = 'random',
  filter.raster = TRUE,
  dir.name = 'SDM'
)

# Define the native models list
NATIVERG <-  c("GLM", "GAM", "RF", "MAXENT", "MAXNET", "XGBOOST")

# Individual Model Creation
myBiomodModelOut <- BIOMOD_Modeling(
  bm.format = myDataMod,
  modeling.id = "TST",
  models = NATIVERG,
  CV.strategy = 'random',
  CV.nb.rep = 3,
  CV.perc = 0.7,
  CV.do.full.models = FALSE,
  OPT.data.type = 'binary',
  OPT.strategy = 'user.defined',
  OPT.user.val = user.val,
  metric.eval = c('TSS', 'ROC'),
  var.import = 0,
  do.progress = FALSE,
  nb.cpu = 2
)

EVALMODEL <- get_evaluations(myBiomodModelOut)

But once again, I advise to carefully think of your modeling plan before đź‘€

As for the fact that you still get allData_allRun dataset in your evaluation table, it is because do.full.models parameter within the BIOMOD_Modeling function is now called CV.do.full.models`.

Hope it helps, Maya

yuliaUU commented 12 hours ago

Hi Maya, thanks a lot for detailed explanations! and thank you for showing optBB part: solved the issue I have been having

Regarding your warning "pseudo-absences will be different between tuning and modeling datasets, and cross-validation selection as well": why will it be an issue? the data used for tunning are not associated with the modeling data, and i want to have those datasets "independent" of each other, so my hyperparameter tunning is not biased. Or am I missing smth ?

thank you a lot for all your help!

MayaGueguen commented 12 hours ago

Glad that it helped !

Regarding the pseudo-absences and cross-validation datasets :

I was mentioning it as you tune 3 models (RF, MAXENT, XGBOOST) over your data. Meaning that you will have a specific set of parameter for each PA x CV combination. What do you intend to do with it afterwards ?

Let's say you have 2 PA datasets and 3 CV datasets, leading to 2 * 3 = 6 datasets in total. They have been selected over the tuning part of your data. You will have specific parameter values for PA1_RUN1, that might differ from PA1_RUN2.

Then you want to give these values to your modeling part of your data, but for which selection of PA and CV will be different. Meaning that you are matching parameter values calibrated over a selection of background points and data that are different than the one you will use to run your models....

:information_source: Note that in my code example, I removed the CV part. It is possible to make the selection of pseudo-absences and keeping it through your tuning and modeling datasets (:warning: which I haven't done here !) but this gets complicated for CV as you are changing your observation data....

Do you see what I mean ?

Maya

yuliaUU commented 12 hours ago

yes, yes i see the issue. I am doing 1 PA set and no CV. I guess what i was thinking to achieve is that i do hyperparameter tunning on a separate dataset, so i get unbiased estimates, and then apply those tuned hyperparameters to my actual models to do calibration/validation