Help with BIOMOD_4.2-5-2- [tune/train/test/evaluation data]

yuliaUU commented 3 weeks ago

Hi

I have more theoretical question. Ideally what I would like to do is to split data in 3 parts: one used for tuning, other for testing and last one for evaluation. to reproduce:

library(terra)
library(tidyverse)
library(biomod2)

data(DataSpecies)
DataSpecies<- DataSpecies |> select(X_WGS84,Y_WGS84,GuloGulo) |> mutate(GuloGulo=1) |> drop_na()
# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies$GuloGulo)

# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]

# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
data(bioclim_current)
myExpl <- terra::rast(bioclim_current)

based on all the documentations and issues posted by others, I so far figured the tunning part (but still not sure how can I ensure that data I used in tunning is not part of the modelling process). SO for now i am just trying to make it work on the same data:

myBiomodData <- BIOMOD_FormatingData(
  resp.name = "name",
  resp.var = myResp,
  resp.xy = myRespXY,
  expl.var = myExpl,
  PA.nb.rep = 1,
  PA.nb.absences = nb,
  PA.strategy <- 'disk'
  PA.dist.min <- 80000
  PA.dist.max <- 1000000
  filter.raster = TRUE,
  dir.name = 'SDM'
)
# Do CV
cv.k <- bm_CrossValidation(bm.format = myBiomodData,
                           strategy = 'kfold',
                           nb.rep = 2,
                           k = 3)
### Tune Models of the models ####
d.opt <- bm_ModelingOptions(
  data.type = "binary",
  models = c("GLM", "GAM.mgcv.gam", "RF", "MAXENT", "MAXNET", "XGBOOST"),
  strategy = "default",
  calib.lines = cv.k,
  bm.format = myBiomodData
)
# Tuning for RF model
tuned.rf <- bm_Tuning(model = 'RF',
                      tuning.fun = 'rf',
                      do.formula = TRUE,
                      bm.options = d.opt@options$RF.binary.randomForest.randomForest,
                      bm.format = myBiomodData)

# Tuning for XGBOOST
tuned.xgboost <- bm_Tuning(model = "XGBOOST",
                           tuning.fun = "xgbTree",
                           do.formula = TRUE,
                           bm.options = d.opt@options$XGBOOST.binary.xgboost,
                           bm.format = myBiomodData)

# Tuning for MAXENT model
tuned.maxent <- bm_Tuning(model = "MAXENT",
                          tuning.fun = "maxnet",  # Use the correct tuning function
                          do.formula = TRUE,
                          bm.options = d.opt@options$MAXENT.binary.MAXENT,
                          bm.format = myBiomodData,
                          metric.eval = "auc.val.avg"  # Specify a valid evaluation metric
)
user.val <- list(
  RF.binary.randomForest.randomForest = tuned.rf,
  MAXENT.binary.MAXENT.MAXENT = tuned.maxent,  
  XGBOOST.binary.xgboost.xgboost = tuned.xgboost  
)
# Define the native models list
NATIVERG <-   models = c("GLM", "GAM.mgcv.gam", "RF", "MAXENT", "MAXNET", "XGBOOST"),

# Create the BIOMOD options
myBiomodOption <-  bm_ModelingOptions(data.type = 'binary',
                                      models = NATIVERG,
                                      strategy = 'user.defined',
                                      user.val = user.val,
                                      user.base = 'bigboss',
                                      calib.lines = cv.k,
                                      bm.format = myBiomodData)
### Individual Model Creation ###
myBiomodModelOut <- BIOMOD_Modeling(
  bm.format = myBiomodData,
  modeling.id = "TST",
  models = NATIVERG,
  bm.options = myBiomodOptions,
  CV.strategy = 'random',
  CV.nb.rep = 3,
  CV.perc = 0.7,
  metric.eval = c('TSS', 'ROC'),
  var.import = 0,
  do.progress = FALSE,
  do.full.models=FALSE,
  nb.cpu = 2
)
EVALMODEL <- get_evaluations(myBiomodModelOut)

I know how to split datat into 3 parts, but I could not figure out how to integrate it with biomod bm_Tuning and bm_ModelingOptions functions
issue i have been having: when i tune: it only get tuning parameters for PA_RUN.. which make sense, but then BIOMOD_Modeling option has also _allData and all_Runs: I don't want those in my models. even though i set do.full.models=FALSE, my EVALMODEL table still has the rows for them

thank you a lot for your help!

MayaGueguen commented 3 weeks ago

Hello Yulia,

Thank you for your detailed question. It is indeed a case that is not often asked, so I'll try and suggest you a way to make it work 🙂

I would say that the simplest way to do that, is to split your dataset in 3 parts (that I will call after tuning, modeling and evaluation), before giving it to any biomod2 functions, and then to plan your modelling scheme in two steps :

Creating a BIOMOD.formated.data only with your tuning data, use this object to tune your models and retrieve the modeling options obtained
Creating a secund BIOMOD.formated.data object, giving it on one side the data for modeling, and on the other side the data for evaluating, and use the user.defined strategy to define modeling options.

⚠️ Please consider that it means :

pseudo-absences will be different between tuning and modeling datasets, and cross-validation selection as well
you will need to have real absences within your evaluation dataset

⚠️ So I would advise to give another thorough thought to be sure you understand all your simulation choices !

If it is still something you consider to do, here is in practice how it could be done :

## ------------- TUNING -------------  ## 
myDataTun<- BIOMOD_FormatingData(
  resp.name = "name",
  resp.var = myResp.tun,
  resp.xy = myRespXY.tun,
  expl.var = myExpl,
  PA.nb.rep = 1,
  PA.nb.absences = 1000,
  PA.strategy = 'random',
  filter.raster = TRUE,
  dir.name = 'SDM'
)

# Tune models
optTun <- bm_ModelingOptions(
  data.type = "binary",
  models = c("RF", "MAXENT", "XGBOOST"),
  strategy = "tuned",
  bm.format = myDataTun
)

# Get other models options
optBB <- bm_ModelingOptions(
  data.type = "binary",
  models = c("GLM", "GAM.mgcv.gam", "MAXNET"),
  strategy = "bigboss",
  bm.format = myDataTun
)

# Gather options
user.val <- c(optTun@options, optBB@options)

## ------------- MODELING -------------  ## 
myDataMod<- BIOMOD_FormatingData(
  resp.name = "name",
  resp.var = myResp.mod,
  resp.xy = myRespXY.mod,
  expl.var = myExpl,
  eval.resp.var = myResp.eval,
  eval.resp.xy = myRespXY.eval,
  eval.expl.var = myExpl,
  PA.nb.rep = 1,
  PA.nb.absences = 1000,
  PA.strategy = 'random',
  filter.raster = TRUE,
  dir.name = 'SDM'
)

# Define the native models list
NATIVERG <-  c("GLM", "GAM", "RF", "MAXENT", "MAXNET", "XGBOOST")

# Individual Model Creation
myBiomodModelOut <- BIOMOD_Modeling(
  bm.format = myDataMod,
  modeling.id = "TST",
  models = NATIVERG,
  CV.strategy = 'random',
  CV.nb.rep = 3,
  CV.perc = 0.7,
  CV.do.full.models = FALSE,
  OPT.data.type = 'binary',
  OPT.strategy = 'user.defined',
  OPT.user.val = user.val,
  metric.eval = c('TSS', 'ROC'),
  var.import = 0,
  do.progress = FALSE,
  nb.cpu = 2
)

EVALMODEL <- get_evaluations(myBiomodModelOut)

But once again, I advise to carefully think of your modeling plan before 👀

As for the fact that you still get allData_allRun dataset in your evaluation table, it is because do.full.models parameter within the BIOMOD_Modeling function is now called CV.do.full.models`.

Hope it helps, Maya

yuliaUU commented 3 weeks ago

Hi Maya, thanks a lot for detailed explanations! and thank you for showing optBB part: solved the issue I have been having

Regarding your warning "pseudo-absences will be different between tuning and modeling datasets, and cross-validation selection as well": why will it be an issue? the data used for tunning are not associated with the modeling data, and i want to have those datasets "independent" of each other, so my hyperparameter tunning is not biased. Or am I missing smth ?

thank you a lot for all your help!

MayaGueguen commented 3 weeks ago

Glad that it helped !

Regarding the pseudo-absences and cross-validation datasets :

I was mentioning it as you tune 3 models (RF, MAXENT, XGBOOST) over your data. Meaning that you will have a specific set of parameter for each PA x CV combination. What do you intend to do with it afterwards ?

Let's say you have 2 PA datasets and 3 CV datasets, leading to 2 * 3 = 6 datasets in total. They have been selected over the tuning part of your data. You will have specific parameter values for PA1_RUN1, that might differ from PA1_RUN2.

Then you want to give these values to your modeling part of your data, but for which selection of PA and CV will be different. Meaning that you are matching parameter values calibrated over a selection of background points and data that are different than the one you will use to run your models....

:information_source: Note that in my code example, I removed the CV part. It is possible to make the selection of pseudo-absences and keeping it through your tuning and modeling datasets ( :warning: which I haven't done here !) but this gets complicated for CV as you are changing your observation data....

Do you see what I mean ?

Maya

yuliaUU commented 3 weeks ago

yes, yes i see the issue. I am doing 1 PA set and no CV. I guess what i was thinking to achieve is that i do hyperparameter tunning on a separate dataset, so i get unbiased estimates, and then apply those tuned hyperparameters to my actual models to do calibration/validation

Cam-in commented 1 week ago

Hi Ladies. Thank you both for this explanation because it helps others understand the new biomod version (which I still adapting to my old scripts). My question in this direction is: when evaluating the model, eg. RF, in a testing (or calibration), modeling, or evaluation under a TSS metric, which one should be informed of calibration/validation/evaluation?

In such a case, I would say evaluation, but what if there is not an independent separated data set for evaluation? I consider the TSS value given for calibration is more critical since we also separate a higher % of data.

Thank you =) Cami

MayaGueguen commented 1 week ago

Hello there :wave:

I'll try to give feedbacks to both of you :slightly_smiling_face:

@yuliaUU : I think I get your idea of having parameters tuned once and for all, but I'm not sure this is realistic here. I guess that this would work in two cases :

you know the full complete niche of your species at some time. You parameterize your model over this data, and can use over new data for example from a new sampling time window.
you have a big dataset, and even if not representing the full niche, you think it is unbiased (or at least you know how it is biased). Then you run several times the tuning over it to get average parameters, that could be use on other data as "what works best in average on this species".

But here you do not have your full niche, and you tune over different cross-validation dataset but you don't summarize the information. So you want to use parameters tuned over 1 specific part of your data. I'm not sure these can be refered to as unbiased estimates.

@Cam-in : actually, I'm not sure to understand your question :see_no_evil: Are you just talking about wording, how to call the different dataset you use depending on how you split or sample your data ? Or is it something else ?

Maya

Cam-in commented 1 week ago

Hi, UPSI!! 🙈 I will try to explain myself better, but I hope this will not interrupt @yuliaUU's issue: When I run biomod with split data, and check myBiomodModelEval the results are the evaluation of the chosen metric in different columns: calibration, validation, and evaluation. If there is no separate evaluation dataset, then such a column results as NA. Previously in biomod 3.5 there was only one metric value.

Hence, my concern is how to evaluate the different runs dependent on a chosen metric, by the calibration value, the validation or the evaluation? In my case, I can not have an evaluation dataset since I do not have real absences.

Hope this clarifies and that it does not cause chaos for using the same issue.🙂 Thanks, Cami

MayaGueguen commented 1 week ago

Hello @Cam-in,

No worries, this is fine :slightly_smiling_face:

Old version of biomod2 was indeed showing only one value for each evaluation metric, which was corresponding to the current validation column. Meaning that the dataset was splitting in 2 and returned value was the one computed on the part NOT used to calibrate the models. However, it is as informative to also have the metric calculated over this calibration part, for it allows to see already if the model is managing to work from the beginning. Finally, but it is rare to have data to do so, evaluation column allows you to see how your model transfer onto new data.

In summary :

calibration : it is calculated over the data used to calibrate the model, it should have good values, otherwise it means that from the beginning you struggle to fit to your data
validation : it is calculated over the other part of the data, not used in the calibration. It allows to see both if your model is stable, and is your cross-validation splitting is efficient
evaluation : it is the holy grail :crossed_swords: if you have an independant dataset with both presences and absences, AND your evaluation is also good for this dataset : :champagne:

Hope it helps, Maya

Cam-in commented 1 week ago

Super @MayaGueguen Thanks, knowing the old is the new validation clarifies everything. But then, I will think it is a good idea to evaluate the model through the validation value (for e.g. consider single algorithm model to be included in the ensemble after a threshold) but the bm_PlotEvalMean plots the calibration. And that is confusing?

I think in the summary you mean "evaluation" in the last point ;)

Cheers! Cami

MayaGueguen commented 1 week ago

Thank you, I corrected ! :blush:

MayaGueguen commented 1 week ago

@Cam-in

If you want to have an illustration of biomod2 difference between calibration, validation and evaluation datasets, we just published a new video :movie_camera: on our website. You might want to check the 02. Datasets part, at 4:14 :eyes:

biomodhub / biomod2

Help with BIOMOD_4.2-5-2- [tune/train/test/evaluation data] #526

Regarding the pseudo-absences and cross-validation datasets :