biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.
77 stars 21 forks source link

Unusual model evaluation (TSS/ROC) scores for biomod2...RF overfit ? #460

Closed joshikp01 closed 1 month ago

joshikp01 commented 1 month ago

Hello !!

I am trying to use the BIOMOD2 modeling but I am getting very weird model evaluation scores.... when using calibration data the RF is getting 1 every time (ROC/TSS) but when using the validation dataset the RF is getting the lowest TSS scores. I had used calibrated the RF code before but right now after the update (4.2-5) it is back to getting inflated TSS/ROC scores. Please find the code below: (Note: total presence location =94)

Load required packages

library("raster") library("ggplot2") library("gridExtra") library("rasterVis") library("sf") library("usdm") library("biomod2") library("dismo") library("mgcv") library("gam")

variables for current projection:

DEM<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/dem.asc") slope<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/slope.asc") aspect<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/aspect.asc") bio1<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio1.asc") bio2<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio2.asc") bio3<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio3.asc") bio4<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio4.asc") bio5<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio5.asc") bio6<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio6.asc") bio7<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio7.asc") bio8<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio8.asc") bio9<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio9.asc") bio10<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio10.asc") bio11<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio11.asc") bio12<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio12.asc") bio13<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio13.asc") bio14<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio14.asc") bio15<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio15.asc") bio16<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio16.asc") bio17<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio17.asc") bio18<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio18.asc") bio19<- raster ("F:/Khagendra (Ensemble analysis)/Worldclim data/midhills/ASCII/bio19.asc")

stackthe variables

myExpl <- raster::stack (DEM,slope,aspect, bio1,bio2, bio3, bio4, bio5, bio6, bio7, bio8, bio9, bio10, bio11, bio12, bio13,bio14, bio15, bio16, bio17, bio18, bio19) plot(myExpl)

Convert the raster stack to a data frame

myExpl_df <- as.data.frame(myExpl, xy = TRUE) myExpl_df

Calculate VIF

vif_result <- vifcor(myExpl_df, 0.7)

Print VIF results

print(vif_result)

rasterstack of layers with below 5 VIF and 0.7 Corr

myExpl1<- raster::stack (DEM, slope, aspect, bio3, bio4, bio12, bio14, bio15)

plot(myExpl1)

thinned_sal<- read.csv("F:/Khagendra (Ensemble analysis)/Towa data/Tree location/Final location species wise/GBIF and other secondary sourcer/thinned/thinned_sal.csv")

Species data

myRespName <- 'Sal' myRespName

My Resp Data

myResp <- as.numeric(thinned_sal [,myRespName]) myResp

head(myResp)

The XY coordinates of species data

myRespXY <- thinned_sal [,c("Longitude", "Latitude")]

myRespXY

Formating biomod data

myBiomodData <- BIOMOD_FormatingData(resp.var = myResp, expl.var = myExpl1, resp.xy = myRespXY, resp.name = myRespName, PA.nb.rep = 3, PA.nb.absences = 2000, PA.strategy = 'disk', PA.dist.min =10000) plot(myBiomodData)

Define MAXENT options

user.MAXENT <- list('_allData_allRun'= list(maximumiterations = 10,

                     threshold = FALSE, 
                     path_to_maxent.jar = "E:/software/maxent/maxent.jar"))

Define RF options

user.RF <- list('_allData_allRun'= list(nodesize = 5, maxnodes = 10))

Define user values

user.val <- list(RF.binary.randomForest.randomForest = user.RF) #,MAXENT.binary.MAXENT.MAXENT= user.MAXENT)

all.models<- c("ANN", "CTA", "FDA", "GAM", "GBM", "GLM", "MARS", "MAXENT", "RF", "SRE")

?bm_ModelingOptions

Set up modeling options

myBiomodOption <- bm_ModelingOptions( data.type = "binary", models = all.models, strategy = 'user.defined', user.val= user.val, bm.format = myBiomodData, calib.lines = NULL )

?BIOMOD_Modeling #OPT.user = myBiomodOption,

myBiomodModelOut <- BIOMODModeling( bm.format = myBiomodData, modeling.id = paste(myRespName, "Sal", sep = ""), models = all.models, OPT.user.val=user.val, CV.strategy = "random", #random for normal setting CV.nb.rep = 3, CV.perc = 0.8, # Specify data split percentage for random cross-validation CV.balance = 'presences', # Balance strategy for cross-validation OPT.strategy= 'user.defined', weights = NULL, # Observation weights (optional) prevalence = NULL, # Species prevalence (optional) metric.eval = c("ROC", "TSS"), # Additional evaluation metrics var.import = 0, # Number of permutations for variable importance,..3 scale.models = FALSE, # Disable model scaling nb.cpu = 6, # Number of CPU cores do.progress = TRUE, # Show progress bar seed.val = 42, ) myBiomodModelOut

Get all models evaluation

myBiomodModelEval <- get_evaluations(myBiomodModelOut) myBiomodModelEval export<- as.data.frame(myBiomodModelEval)

write.csv(export, "F:/Khagendra (Ensemble analysis)/fig/ensembelresults.csv")

Plot model evaluation scores

plot<- bm_PlotEvalMean(myBiomodModelOut, metric.eval = c("ROC", "TSS"), dataset = "calibration", group.by = "algo", do.plot = TRUE) #if you wanna assign to plot

plot

The two figures are uploaded: caliberated validation

Another issue is when i ensemble (TSS above 0.7), the TSS score of the ensemble model is less than RF every time.

I have provided a link for my code and all the files !! here, https://drive.google.com/file/d/1GdrzfXUYNKXneW5JICrx2l4EhrcJqufy/view?usp=sharing

HeleneBlt commented 1 month ago

Hello there,

Thanks a lot for all this information: it's really easier to help you ! Indeed, you have an overfitting for RF.

Here, you only change the options for "_allData_allRun". You could have a look at the options with the line : print(myBiomodOption,dataset = "_PA1_allRun")

You can change for all the different PA datasets like this :

 #Define MAXENT options
myMAXENToptions <- list(threshold = FALSE, path_to_maxent.jar = ".")
user.MAXENT <- list('_allData_allRun'= myMAXENToptions,
                    '_PA1_allRun' = myMAXENToptions,
                    '_PA2_allRun' = myMAXENToptions,
                    '_PA3_allRun' = myMAXENToptions)

#Define RF options
myRFoptions <- list(nodesize = 5, maxnodes = 10)
user.RF <- rep( list(myRFoptions), (ncol(myBiomodData@PA.table) + 1 ))
names(user.RF) <- c( paste0("_", names(myBiomodData@PA.table), "_allRun"), "_allData_allRun")

#Define user values
user.val <- list(RF.binary.randomForest.randomForest = user.RF ,MAXENT.binary.MAXENT.MAXENT= user.MAXENT)

#Set up modeling options
myBiomodOption <- bm_ModelingOptions(
  data.type = "binary",
  models = all.models,
  strategy = 'user.defined',
  user.val= user.val,
  bm.format = myBiomodData,
  calib.lines = NULL
)

myBiomodModelOut <- BIOMOD_Modeling(
  bm.format = myBiomodData,
  modeling.id = paste(myRespName, "Sal", sep = "_"),
  models = all.models,
  OPT.strategy= 'user.defined'
  #OPT.user.val=user.val, 
  OPT.user = myBiomodOption,  #!!
  CV.strategy = "random", 
  CV.nb.rep = 3,
  CV.perc = 0.8, 
  metric.eval = c("ROC", "TSS"), 
  var.import = 0, 
  scale.models = FALSE, 
  nb.cpu = 6, 
  do.progress = TRUE, 
  seed.val = 42,
)

If you give OPT.user = myBiomodOption, BIOMOD_Modeling will automatically give the options of '_PAx_allRun' to all runs with PAx (_PAx_RUN1, _PAx_RUN2,...)

I hope it helps! Don't hesitate if you have any other questions about this new version of biomod2.

Hélène

joshikp01 commented 1 month ago

Dear @HeleneBlt

Thank you very much for the clear explanation and code. I have another question:

Does having an ensemble model with less TSS score than the individual score mean any problem or it is okay? I am asking this because my final ensemble model is getting a lower TSS score than the RF/GBM. What do you suggest in such cases? Ensemble or just go ahead with the individual model?

Thank you in advance :)

HeleneBlt commented 1 month ago

Hi again,

Yes, actually it is quite normal. As you probably already read in the issue #447, the Ensemble Model is an average model. As you suggest, it is important to look at the calibration and validation scores (as it can show an overfitting for example :wink: ). I would also strongly advise looking at EMcv or EMci and EMci.alpha to see the variability between the models. With this exploration and depending on your goals, you can choose if a single model will be better or if the Ensemble Model will give you more realistic results on the whole study area. It really depends on your case.

Hélène

joshikp01 commented 1 month ago

Dear Helene,

Thank you and grateful to you for assisting me.