biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.
77 stars 21 forks source link

Issue with Low TSS Scores in Ensemble Modeling of Australian Bustard using Biomod2 (lamichhanesaurav2078@gmail.com) #474

Open Saurav1227 opened 2 weeks ago

Saurav1227 commented 2 weeks ago

Context and question I am working on an ensemble modeling project for the Australian Bustard in Australia using the Biomod2 package. While the code runs without errors, I consistently receive low True Skill Statistic (TSS) scores, except for the Random Forest algorithm. Here are the specifics: High TSS Score (>0.8): Only for Random Forest Low TSS Score (<0.3 or less): For other algorithms Attempts to Troubleshoot:

Spatial Auto-correlation: Tried grids of 11 km and 5 5 km. Pseudo Absence Points: Used 10,000, 15,000, and 30,000 points outside a 1 km buffer from the pseudo points. Geographic Scale: Applied the model at the state level (e.g., Western Australia) but encountered similar low TSS scores. Data Distribution: My data is homogeneously distributed, which I suspect might be causing the issue.

Related code

Load rasters library("raster")

Load required packages library("raster") library("ggplot2") library("gridExtra") library("rasterVis") library("sf") library("usdm") library("biomod2") library("dismo") library("mgcv") library("gam")

Dem <- raster("D:/Saurav/Ascii/dem.asc") aspect <- raster("D:/Saurav/Ascii/aspect.asc") slope <- raster("D:/Saurav/Ascii/slope.asc") bio1 <- raster("D:/Saurav/Ascii/bio1.asc") bio2 <- raster("D:/Saurav/Ascii/bio2.asc") bio3 <- raster("D:/Saurav/Ascii/bio3.asc") bio4 <- raster("D:/Saurav/Ascii/bio4.asc") bio5 <- raster("D:/Saurav/Ascii/bio5.asc") bio6 <- raster("D:/Saurav/Ascii/bio6.asc") bio7 <- raster("D:/Saurav/Ascii/bio7.asc") bio8 <- raster("D:/Saurav/Ascii/bio8.asc") bio9 <- raster("D:/Saurav/Ascii/bio9.asc") bio10 <- raster("D:/Saurav/Ascii/bio10.asc") bio11 <- raster("D:/Saurav/Ascii/bio11.asc") bio12 <- raster("D:/Saurav/Ascii/bio12.asc") bio13 <- raster("D:/Saurav/Ascii/bio13.asc") bio14 <- raster("D:/Saurav/Ascii/bio14.asc") bio15 <- raster("D:/Saurav/Ascii/bio15.asc") bio16 <- raster("D:/Saurav/Ascii/bio16.asc") bio17 <- raster("D:/Saurav/Ascii/bio17.asc") bio18 <- raster("D:/Saurav/Ascii/bio18.asc") bio19 <- raster("D:/Saurav/Ascii/bio19.asc")

stackthe variables

myExpl <- raster::stack (Dem, slope, aspect, bio1,bio2, bio3, bio4, bio5, bio6, bio7, bio8, bio9, bio10, bio11, bio12, bio13,bio14, bio15, bio16, bio17, bio18, bio19) plot(myExpl)

Convert the raster stack to a data frame myExpl_df <- as.data.frame(myExpl, xy = TRUE) myExpl_df

Calculate VIF vif_result <- vifcor(myExpl_df, 0.7)

Print VIF results print(vif_result)

rasterstack of layers with below 5 VIF and 0.7 Corr

myExpl1<- raster::stack (Dem, slope, aspect, bio2, bio3, bio9, bio15, bio18, bio19)

plot(myExpl1)

Read the CSV file Bustard <- read.csv("D:/Saurav/Ascii2/Bustard.csv", header = TRUE)

ThinnedBustard <- thin(loc.data = Bustard, lat.col = "Latitude", long.col = "Longitude", spec.col = "Bustard", thin.par = 1, reps = 1, locs.thinned.list.return = TRUE, write.log.file = FALSE, out.dir = "D:/Saurav2/Ascii/thinnedBustard")

ThinnedBustard <- read.csv("D:/Saurav2/Ascii/thinnedbustard/thinned_data_thin1.csv", header = TRUE)

Define your response variable myRespName <- 'Bustard' myResp <- as.numeric(ThinnedBustard[, myRespName])

The XY coordinates of species data myRespXY <- ThinnedBustard[, c("Longitude", "Latitude")]

plot(ThinnedBustard)

myRespXY ?BIOMOD_FormatingData

Formating biomod data myBiomodData <- BIOMOD_FormatingData(resp.var = myResp, expl.var = myExpl1, resp.xy = myRespXY, resp.name = myRespName, PA.nb.rep = 3, PA.nb.absences = 15000, PA.strategy = 'disk', PA.dist.min = 1000) # 1 Bustard.csv km buffer myRespXY ?BIOMOD_FormatingData n=round(nrow(ThinnedBustard)*0.7-1) #(counts the number of rows of column to fewed it into the downsampled RF)

Define RF options

myRFoptions <- list(ntree=500, sampsize =c("0"=n, "1"= n), replace=TRUE, nodesize=5) user.RF <- rep( list(myRFoptions), (ncol(myBiomodData@PA.table) + 1 )) names(user.RF) <- c( paste0("_", names(myBiomodData@PA.table), "_allRun"), "_allData_allRun")

Define user values

user.val <- list(RF.binary.randomForest.randomForest = user.RF)

all.models<- c("RF","ANN", "CTA", "FDA", "GAM", "GBM", "GLM", "MARS", "MAXNET" , "SRE")

Set up modeling options

myBiomodOption <- bm_ModelingOptions( data.type = "binary", models = all.models, strategy = 'user.defined', user.val= user.val, bm.format = myBiomodData, calib.lines = NULL )

myBiomodModelOut <- BIOMODModeling( bm.format = myBiomodData, modeling.id = paste(myRespName, "Bustard", sep = ""), models = all.models, OPT.strategy= 'user.defined', OPT.user = myBiomodOption, #!! CV.strategy = "random", CV.nb.rep = 3, CV.perc = 0.7, metric.eval = c("ROC", "TSS"), var.import = 0, scale.models = FALSE, nb.cpu = 8, do.progress = TRUE, seed.val = 42, )

myCalibLines <- get_calib_lines(myBiomodModelOut) plot(myBiomodData, calib.lines = myCalibLines)

Get all models evaluation myBiomodModelEval <- get_evaluations(myBiomodModelOut) myBiomodModelEval export<- as.data.frame(myBiomodModelEval)

write.csv(export, "D:/Saurav2/Ascii/ensembelresults.csv")

Plot model evaluation scores plot<- bm_PlotEvalMean(myBiomodModelOut, metric.eval = c("ROC", "TSS"), dataset="calibration", group.by = "algo", do.plot = TRUE) #if you wanna assign to plot

plot

https://drive.google.com/file/d/16N770JVTS4uH0Fn2cRNFYEa-IPFEsI6J/view?usp=sharing

Data: https://drive.google.com/file/d/1_BNgL6VUd_kpvIVlfyC7RaOYbrj3HRnN/view?usp=sharing

Variables: https://drive.google.com/file/d/1CK0w3blkw-Oz9R-TWm--zjQ0l7vZrFGQ/view?usp=sharing

HeleneBlt commented 2 weeks ago

Hello there,

I see nothing wrong here 👀 You have indeed (relatively) low TSS scores and some overfitting for RF (even with the down sampling method ! ). However, if you look at the validation, the scores with the different algorithms will be similar (except SRE).

To correct the overfitting of RF, you can also try to increase nodesize and try different maxnodes.
(Note that we will soon be ready to switch to version 4.2-6 on github which will contain a new single model named RFd computing down-sampled RF without having to specify options for basic RF like we do now. It will be easier for you.)

As you already suggested, your data seem homogeneous distributed, so trying with different explanation variables or at another scale could be the solution to improve the model. I hope you'll find a solution soon!

Hélène

Saurav1227 commented 2 weeks ago

Hi there,

Thank you for your response and checking my code and data. I am working on it.

Regards! Saurav

On Mon, 10 Jun 2024 at 8:54 pm, HBlancheteau @.***> wrote:

Hello there,

I see nothing wrong here 👀 You have indeed (relatively) low TSS scores and some overfitting for RF (even with the down sampling method ! ). However, if you look at the validation, the scores with the different algorithms will be similar (except SRE).

To correct the overfitting of RF, you can also try to increase nodesize and try different maxnodes arguments. (Note that we will soon be ready to switch to version 4.2-6 on github which will contain a new single model named RFd computing down-sampled RF without having to specify options for basic RF like we do now. It will be easier for you.)

As you already suggested, your data seem homogeneous distributed, so trying with different explanation variables or at another scale could be the solution to improve the model. I hope you'll find a solution soon!

Hélène

— Reply to this email directly, view it on GitHub https://github.com/biomodhub/biomod2/issues/474#issuecomment-2158272045, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIJS6VU4EXXFUIQX7NFJPLLZGWOZRAVCNFSM6AAAAABJBQWLCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJYGI3TEMBUGU . You are receiving this because you authored the thread.Message ID: @.***>