Open yuliaUU opened 3 weeks ago
Hello Yulia,
Thank you for your detailed question. It is indeed a case that is not often asked, so I'll try and suggest you a way to make it work 🙂
I would say that the simplest way to do that, is to split your dataset in 3 parts (that I will call after tuning, modeling and evaluation), before giving it to any biomod2 functions, and then to plan your modelling scheme in two steps :
BIOMOD.formated.data
only with your tuning data, use this object to tune your models and retrieve the modeling options obtainedBIOMOD.formated.data
object, giving it on one side the data for modeling, and on the other side the data for evaluating, and use the user.defined
strategy to define modeling options.⚠️ Please consider that it means :
⚠️ So I would advise to give another thorough thought to be sure you understand all your simulation choices !
If it is still something you consider to do, here is in practice how it could be done :
## ------------- TUNING ------------- ##
myDataTun<- BIOMOD_FormatingData(
resp.name = "name",
resp.var = myResp.tun,
resp.xy = myRespXY.tun,
expl.var = myExpl,
PA.nb.rep = 1,
PA.nb.absences = 1000,
PA.strategy = 'random',
filter.raster = TRUE,
dir.name = 'SDM'
)
# Tune models
optTun <- bm_ModelingOptions(
data.type = "binary",
models = c("RF", "MAXENT", "XGBOOST"),
strategy = "tuned",
bm.format = myDataTun
)
# Get other models options
optBB <- bm_ModelingOptions(
data.type = "binary",
models = c("GLM", "GAM.mgcv.gam", "MAXNET"),
strategy = "bigboss",
bm.format = myDataTun
)
# Gather options
user.val <- c(optTun@options, optBB@options)
## ------------- MODELING ------------- ##
myDataMod<- BIOMOD_FormatingData(
resp.name = "name",
resp.var = myResp.mod,
resp.xy = myRespXY.mod,
expl.var = myExpl,
eval.resp.var = myResp.eval,
eval.resp.xy = myRespXY.eval,
eval.expl.var = myExpl,
PA.nb.rep = 1,
PA.nb.absences = 1000,
PA.strategy = 'random',
filter.raster = TRUE,
dir.name = 'SDM'
)
# Define the native models list
NATIVERG <- c("GLM", "GAM", "RF", "MAXENT", "MAXNET", "XGBOOST")
# Individual Model Creation
myBiomodModelOut <- BIOMOD_Modeling(
bm.format = myDataMod,
modeling.id = "TST",
models = NATIVERG,
CV.strategy = 'random',
CV.nb.rep = 3,
CV.perc = 0.7,
CV.do.full.models = FALSE,
OPT.data.type = 'binary',
OPT.strategy = 'user.defined',
OPT.user.val = user.val,
metric.eval = c('TSS', 'ROC'),
var.import = 0,
do.progress = FALSE,
nb.cpu = 2
)
EVALMODEL <- get_evaluations(myBiomodModelOut)
But once again, I advise to carefully think of your modeling plan before 👀
As for the fact that you still get allData_allRun
dataset in your evaluation table, it is because do.full.models
parameter within the BIOMOD_Modeling function is now called CV.do.full.models`.
Hope it helps, Maya
Hi Maya, thanks a lot for detailed explanations! and thank you for showing optBB
part: solved the issue I have been having
Regarding your warning "pseudo-absences will be different between tuning and modeling datasets, and cross-validation selection as well": why will it be an issue? the data used for tunning are not associated with the modeling data, and i want to have those datasets "independent" of each other, so my hyperparameter tunning is not biased. Or am I missing smth ?
thank you a lot for all your help!
Glad that it helped !
I was mentioning it as you tune 3 models (RF
, MAXENT
, XGBOOST
) over your data.
Meaning that you will have a specific set of parameter for each PA x CV combination.
What do you intend to do with it afterwards ?
Let's say you have 2 PA datasets and 3 CV datasets, leading to 2 * 3 = 6 datasets in total.
They have been selected over the tuning part of your data.
You will have specific parameter values for PA1_RUN1
, that might differ from PA1_RUN2
.
Then you want to give these values to your modeling part of your data, but for which selection of PA and CV will be different. Meaning that you are matching parameter values calibrated over a selection of background points and data that are different than the one you will use to run your models....
:information_source: Note that in my code example, I removed the CV part. It is possible to make the selection of pseudo-absences and keeping it through your tuning and modeling datasets ( :warning: which I haven't done here !) but this gets complicated for CV as you are changing your observation data....
Do you see what I mean ?
Maya
yes, yes i see the issue. I am doing 1 PA set and no CV. I guess what i was thinking to achieve is that i do hyperparameter tunning on a separate dataset, so i get unbiased estimates, and then apply those tuned hyperparameters to my actual models to do calibration/validation
Hi Ladies. Thank you both for this explanation because it helps others understand the new biomod version (which I still adapting to my old scripts). My question in this direction is: when evaluating the model, eg. RF, in a testing (or calibration), modeling, or evaluation under a TSS metric, which one should be informed of calibration/validation/evaluation?
In such a case, I would say evaluation, but what if there is not an independent separated data set for evaluation? I consider the TSS value given for calibration is more critical since we also separate a higher % of data.
Thank you =) Cami
Hello there :wave:
I'll try to give feedbacks to both of you :slightly_smiling_face:
@yuliaUU : I think I get your idea of having parameters tuned once and for all, but I'm not sure this is realistic here. I guess that this would work in two cases :
But here you do not have your full niche, and you tune over different cross-validation dataset but you don't summarize the information. So you want to use parameters tuned over 1 specific part of your data. I'm not sure these can be refered to as unbiased estimates.
@Cam-in : actually, I'm not sure to understand your question :see_no_evil: Are you just talking about wording, how to call the different dataset you use depending on how you split or sample your data ? Or is it something else ?
Maya
Hi, UPSI!! 🙈 I will try to explain myself better, but I hope this will not interrupt @yuliaUU's issue: When I run biomod with split data, and check myBiomodModelEval the results are the evaluation of the chosen metric in different columns: calibration, validation, and evaluation. If there is no separate evaluation dataset, then such a column results as NA. Previously in biomod 3.5 there was only one metric value.
Hence, my concern is how to evaluate the different runs dependent on a chosen metric, by the calibration value, the validation or the evaluation? In my case, I can not have an evaluation dataset since I do not have real absences.
Hope this clarifies and that it does not cause chaos for using the same issue.🙂 Thanks, Cami
Hello @Cam-in,
No worries, this is fine :slightly_smiling_face:
Old version of biomod2 was indeed showing only one value for each evaluation metric, which was corresponding to the current validation
column. Meaning that the dataset was splitting in 2 and returned value was the one computed on the part NOT used to calibrate the models.
However, it is as informative to also have the metric calculated over this calibration
part, for it allows to see already if the model is managing to work from the beginning.
Finally, but it is rare to have data to do so, evaluation
column allows you to see how your model transfer onto new data.
In summary :
calibration
: it is calculated over the data used to calibrate the model, it should have good values, otherwise it means that from the beginning you struggle to fit to your datavalidation
: it is calculated over the other part of the data, not used in the calibration. It allows to see both if your model is stable, and is your cross-validation splitting is efficientevaluation
: it is the holy grail :crossed_swords: if you have an independant dataset with both presences and absences, AND your evaluation is also good for this dataset : :champagne: Hope it helps, Maya
Super @MayaGueguen Thanks, knowing the old is the new validation clarifies everything. But then, I will think it is a good idea to evaluate the model through the validation value (for e.g. consider single algorithm model to be included in the ensemble after a threshold) but the bm_PlotEvalMean plots the calibration. And that is confusing?
I think in the summary you mean "evaluation" in the last point ;)
Cheers! Cami
Thank you, I corrected ! :blush:
@Cam-in
If you want to have an illustration of biomod2 difference between calibration, validation and evaluation datasets, we just published a new video :movie_camera: on our website. You might want to check the 02. Datasets
part, at 4:14 :eyes:
Hi
I have more theoretical question. Ideally what I would like to do is to split data in 3 parts: one used for tuning, other for testing and last one for evaluation. to reproduce:
based on all the documentations and issues posted by others, I so far figured the tunning part (but still not sure how can I ensure that data I used in tunning is not part of the modelling process). SO for now i am just trying to make it work on the same data:
bm_Tuning
andbm_ModelingOptions
functions_allData
andall_Runs
: I don't want those in my models. even though i setdo.full.models=FALSE,
my EVALMODEL table still has the rows for themthank you a lot for your help!