biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.
83 stars 22 forks source link

Understanding the cutoff ensemble calculation #506

Open jacksonwr5798 opened 2 days ago

jacksonwr5798 commented 2 days ago

I was wondering if you could provide further explanation of the calculation for the cutoff values generated for the ensemble models. I saw the explanation in a previous issue for the calculation of the cutoff values for each individual model, but it does not explain how the value is calculated for the ensemble model. I have attached csv files with the output for each individual model as well as the cutoff values for the ensemble. Based on the kept models, I would have expected the cutoff values for the ensemble to higher. Any information or guidance on this topic would be greatly appreciated.

I am happy to provide further data, code, and output if that would be helpful to answering my question.

Thanks, Jackson cutoff_20240919.csv cutoff_ensemble_20240916.csv ensemble_kept_20240919.csv

 Amphi_data <- BIOMOD_FormatingData(
    resp.var = Amphi_occ['Anaxyrus.quercicus'], #presence data and update for new species ---
    resp.xy = Amphi_occ[, c('longitude', 'latitude')], #location data
    expl.var = bioclim_SE_US_sub, #climate variables
    resp.name = "Anaxyrus.quercicus", ##update when looking at different species ---
    PA.nb.rep = 11, #number of sets of pseudo absences selected
    PA.nb.absences = c(rep(790,10), rep(10000,1)), #, rep(1000,10)#number pseudo absences selected --> match number of obs. from thinned data
    PA.strategy = 'random' #heard that the random/disk approaches are the best
    )
Amphi_data

allModels <- c( 'FDA', 'GAM', 'GBM', 'GLM',
                'MARS', 'MAXNET', 'RF', 'XGBOOST', 'MAXENT', 
                'ANN', 'CTA', 'SRE')
SelectModels <- c('GBM','XGBOOST', 'GLM')

#load default model options to tune the selected models
Amphi_Options <- bm_ModelingOptions(
  data.type = "binary",
  models = allModels, #SelectModels,
  strategy = "default",
  user.val = NULL,
  user.base = "bigboss",
  bm.format = Amphi_data,
  calib.lines = NULL)
Amphi_Options

#Tune GBM
tuned.gbm <- bm_Tuning(model = 'GBM',
                       tuning.fun = 'gbm', ## see in ModelsTable
                       do.formula = FALSE,
                       #do.stepAIC = FALSE,
                       bm.options = Amphi_Options@options$GBM.binary.gbm,
                       metric.eval = "ROC",
                       metric.AIC = "AIC",
                       bm.format = Amphi_data)
tuned.gbm

#Tune xgboost
tuned.xgboost <- bm_Tuning(model = 'XGBOOST',
                                    tuning.fun = 'xgbTree', ## see in ModelsTable
                                    do.formula = FALSE,
                                    #do.stepAIC = FALSE,
                                    bm.options = Amphi_Options@options$XGBOOST.binary.xgboost,
                                    metric.eval = "ROC",
                                    metric.AIC = "AIC",
                                    bm.format = Amphi_data)

tuned.xgboost

#updated 8/5/24
#Define user settings based on tuning output for each run
user.GBM <- list('_PA1_allRun' = list(n.trees = 1000,
                                 interaction.depth = 8,
                                 n.minobsinnode = 10,
                                 shrinkage = 0.01),
                 '_PA2_allRun' = list(n.trees = 500,
                                 interaction.depth = 8,
                                 n.minobsinnode = 10,
                                 shrinkage = 0.01),
                 '_PA3_allRun' = list(n.trees = 1000,
                                 interaction.depth = 8,
                                 n.minobsinnode = 10,
                                 shrinkage = 0.01),
                 '_PA4_allRun' = list(n.trees = 2500,
                                 interaction.depth = 8,
                                 n.minobsinnode = 10,
                                 shrinkage = 0.01),
                 '_PA5_allRun' = list(n.trees = 2500,
                                 interaction.depth = 8,
                                 n.minobsinnode = 10,
                                 shrinkage = 0.01),
                 '_PA6_allRun' = list(n.trees = 1000,
                                 interaction.depth = 8,
                                 n.minobsinnode = 10,
                                 shrinkage = 0.01),
                 '_PA7_allRun' = list(n.trees = 500,
                                 interaction.depth = 8,
                                 n.minobsinnode = 10,
                                 shrinkage = 0.001),
                 '_PA8_allRun' = list(n.trees = 500,
                                 interaction.depth = 8,
                                 n.minobsinnode = 10,
                                 shrinkage = 0.01),
                 '_PA9_allRun' = list(n.trees = 500,
                                 interaction.depth = 5,
                                 n.minobsinnode = 10,
                                 shrinkage = 0.01),
                 '_PA10_allRun' = list(n.trees = 500,
                                  interaction.depth = 8,
                                  n.minobsinnode = 10,
                                  shrinkage = 0.001))#,
# '_allData_allRun' = list(n.trees = 2500,
  #                     interaction.depth = 7,
  ##                   n.minobsinnode = 5,
  #                 shrinkage = 0.001))

#updated 8/5/24
user.XGBOOST <- list('_PA1_allRun' = list(nrounds = 50,
                                     max_depth = 1,
                                     eta = 0.3,
                                     gamma = 0,
                                     colsample_bytree = 0.8,
                                     min_child_weight = 1,
                                     subsample = 0.5),
                     '_PA2_allRun' = list(nrounds = 50,
                                     max_depth = 1,
                                     eta = 0.4,
                                     gamma = 0,
                                     colsample_bytree = 0.8,
                                     min_child_weight = 1,
                                     subsample = 0.5),
                     '_PA3_allRun' = list(nrounds = 50,
                                     max_depth = 1,
                                     eta = 0.3,
                                     gamma = 0,
                                     colsample_bytree = 0.8,
                                     min_child_weight = 1,
                                     subsample = 0.5),
                     '_PA4_allRun' = list(nrounds = 50,
                                     max_depth = 1,
                                     eta = 0.4,
                                     gamma = 0,
                                     colsample_bytree = 0.8,
                                     min_child_weight = 1,
                                     subsample = 0.5),
                      '_PA5_allRun' = list(nrounds = 50,
                                           max_depth = 1,
                                           eta = 0.3,
                                           gamma = 0,
                                           colsample_bytree = 0.8,
                                           min_child_weight = 1,
                                           subsample = 0.5),
                      '_PA6_allRun' = list(nrounds = 50,
                                           max_depth = 1,
                                           eta = 0.4,
                                           gamma = 0,
                                           colsample_bytree = 0.8,
                                           min_child_weight = 1,
                                           subsample = 0.5),
                       '_PA7_allRun' = list(nrounds = 50,
                                       max_depth = 1,
                                       eta = 0.3,
                                       gamma = 0,
                                       colsample_bytree = 0.8,
                                       min_child_weight = 1,
                                       subsample = 0.5),
                        '_PA8_allRun' = list(nrounds = 50,
                                       max_depth = 1,
                                       eta = 0.3,
                                       gamma = 0,
                                       colsample_bytree = 0.6,
                                       min_child_weight = 1,
                                       subsample = 0.5),
                         '_PA9_allRun' = list(nrounds = 50,
                                       max_depth = 1,
                                         eta = 0.3,
                                         gamma = 0,
                                         colsample_bytree = 0.8,
                                         min_child_weight = 1,
                                         subsample = 0.5),
                          '_PA10_allRun' = list(nrounds = 50,
                                           max_depth = 1,
                                           eta = 0.4,
                                           gamma = 0,
                                           colsample_bytree = 0.8,
                                           min_child_weight = 1,
                                           subsample = 0.5))

user.GLM <- list('_PA11_allRun' = list())#family = binomial(link = 'logit'),

#compile user validated model settings
user.val <- list(GBM.binary.gbm.gbm = user.GBM,
                 XGBOOST.binary.xgboost.xgboost = user.XGBOOST,
                 GLM.binary.stats.glm = user.GLM)

#Update model options with user defined model settings
Amphi_Options <- bm_ModelingOptions(data.type = 'binary',
                                    models = c( "GBM", "XGBOOST", "GLM"),
                                    strategy = "user.defined",
                                    user.val = user.val,
                                    #user.base = "bigboss",
                                    bm.format = Amphi_data)
Amphi_Options

AmphiModelOut <- BIOMOD_Modeling(bm.format = Amphi_data,
                                             modeling.id = 'AllModels',
                                             models = c("GLM",
                                                        "GBM",
                                                        #"RF", #select final algo options based on ROC/TSS plot
                                                        #"GAM",
                                                        #"CTA",
                                                        #"ANN", #ANN models fail every time
                                                        #"FDA",
                                                        #"SRE",
                                                        #"MARS",
                                                        #"MAXENT", 
                                                        #"MAXNET",
                                                        "XGBOOST"
                                                        ),
                                              bm.options = Amphi_Options, #model settings --> select tuned or default settings above
                                              models.pa = list(
                                                 "GBM" = c("PA1","PA2","PA3","PA4","PA5","PA6","PA7","PA8","PA9","PA10"),
                                                 "XGBOOST" = c("PA1","PA2","PA3","PA4","PA5","PA6","PA7","PA8","PA9","PA10"),
                                                 #"ANN" = c("PA1","PA2","PA3","PA4","PA5","PA6","PA7","PA8","PA9","PA10"),
                                                 #"RF" = c("PA1","PA2","PA3","PA4","PA5","PA6","PA7","PA8","PA9","PA10"),
                                                 #"CTA" = c("PA1","PA2","PA3","PA4","PA5","PA6","PA7","PA8","PA9","PA10"),
                                                 #"FDA" = c("PA11","PA12","PA13","PA14","PA15","PA16","PA17","PA18","PA19","PA20"),
                                                 #"SRE" = c("PA11","PA12","PA13","PA14","PA15","PA16","PA17","PA18","PA19","PA20"),
                                                 #"MARS" = c("PA11","PA12","PA13","PA14","PA15","PA16","PA17","PA18","PA19","PA20"),
                                                 "GLM" = c("PA11")), 
                                                 #"GAM" = c("PA21"),
                                                 #"MAXNET" = c("PA21")),
                                                 #"MAXENT" = c("PA21")), #Removed MAXENT because it was lowering the cutoff significantly
                                               CV.strategy = 'env',
                                               CV.k = 2,
                                               CV.balance = "presences",
                                               CV.strat = "both",
                                               var.import = 1,
                                               metric.eval = c('TSS','ROC'))  #'KAPPA'
AmphiModelOut #check for failed model

cutoff <- get_evaluations(AmphiModelOut)
write.csv(cutoff, file = "cutoff_20240919.csv")

AmphiModelEM <- BIOMOD_EnsembleModeling(bm.mod = AmphiModelOut,
                                                 models.chosen = 'all',
                                                 em.by = 'all',
                                                 em.algo = c(#'EMmean', #'EMcv', 'EMci', 'EMmedian', 'EMca',
                                                 'EMwmean'),
                                                 metric.select = c('TSS'),
                                                 metric.select.thresh = c(0.7),
                                                 metric.eval = c('TSS', 'ROC'),
                                                 var.import = 1,
                                                 EMci.alpha = 0.05, #changed back to 0.05 on 7/25/24
                                                 EMwmean.decay = 'proportional')
AmphiModelEM
get_built_models(AmphiModelEM)
ensemble_kept <- get_kept_models(AmphiModelEM)
write.csv(ensemble_kept, file = "ensemble_kept_20240919.csv")

cutoff_ensemble <- get_evaluations(AmphiModelEM) #summarizes evaluation results
write.csv(cutoff_ensemble, file = "cutoff_ensemble_20240916.csv")

cutoff_20240919.csv cutoff_ensemble_20240916.csv ensemble_kept_20240919.csv

MayaGueguen commented 1 day ago

Hello Jackson 👋

Thank you for posting with lot of details and informations 🙏

Here is a little recap of what's going on when doing ensemble and evaluating them :

Hope it helps, and please do not hesitate if some things still need to be clarified 👀

Maya

jacksonwr5798 commented 1 day ago

Thank you for getting back so quickly! Should I be concerned that the cutoffs are so low? With a cutoff that low, my initial thought is that future projections are predicting widespread species distribution when that may not be the case. Does a lower cutoff indicate greater uncertainty in the models? I used this code/method with a few other species and got varying cutoff values for the ensemble (100-700), so I was just curious if that is what was driving some of this.

I am going to rerun a few things based on some of the parameters that you discussed above. I may have additional follow up questions but this was very helpful!

Jackson