RF downsample with k-fold spatial block cross-validation

BRamiroSanchez commented 1 year ago

Hi

I'm also trying to implement RF down-sample following best guidelines by Valavi et al. (2021), with a user-defined k-fold spatial block cross-validation strategy.

I have built my own CV.user.table for cross-validation using the blockCV R package. To match the resulting object from blockCV::cv_spatial() with the calib.lines object from biomod, I have made sure to rename the column names accordingly (i.e., _allData_RUNx) . When I try to run biomod_Modeling() with the modified RF down-sample options (i.e., sampsize argument), I get an error error regarding the sampsize argument:

-=-=-=--=-=-=- cluster1_allData_RUN4_RF 
Model=Breiman and Cutler's random forests for classification and regression
    > RF modeling...Error in randomForest.default(m, y, ...) : 
  sampsize can not be larger than class frequency
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'object' in selecting a method for function 'predict': object 'model.bm' not found

My guess, after reading thread #172 and thread #267, is that I might not be specifying the correct number of samples (too many) after having split my data into folds for cross-validation. My intention is to always downsample to the minimum of presence/absence occurrences available in the k-fold. Below is an extract of my code:

# 4. k-fold spatial block CV
cur.blocks <- cv_spatial(x = pa_data,
                         column = "Observed", # the response column (binary or multi-class)
                         r = baseline[["BO21_dissoxmean_bdmean"]],
                         k = nb.of.folds, # number of folds: 4
                         size = blocksize, # size of the blocks in metres
                         hexagon = TRUE,
                         selection = "random", # random blocks-to-fold
                         iteration = 100, # find evenly dispersed folds
                         biomod2 = TRUE) # also create folds for biomod2

colnames(cur.blocks$biomod_table) <-  paste0("_allData","_RUN", 1:nb.of.folds)
saveRDS(cur.blocks$biomod_table, paste0(initial.wd, "/data/", sp, "_blocks.RDS"))

# E.g output cur.blocks$records for one species
# train_0 train_1 test_0 test_1
# 1      95     219     33     63
# 2     102     200     26     82
# 3      76     209     52     73
# 4     111     218     17     64
# E.g output cur.blocks$records for another species
# train_0 train_1 test_0 test_1
# 1     262      31     81     36
# 2     262      51     81     16
# 3     250      59     93      8
# 4     255      60     88      7

# head(cur.blocks$biomod_table)
# _allData_RUN1 _allData_RUN2 _allData_RUN3 _allData_RUN4
# [1,]          TRUE         FALSE          TRUE          TRUE
# [2,]          TRUE         FALSE          TRUE          TRUE
# [3,]          TRUE         FALSE          TRUE          TRUE
# [4,]          TRUE         FALSE          TRUE          TRUE
# [5,]          TRUE         FALSE          TRUE          TRUE
# [6,]          TRUE         FALSE          TRUE          TRUE

# 5. Biomod data preparation
run.data <- BIOMOD_FormatingData(resp.var = P.points, 
                                 expl.var = calib.env.data,
                                 resp.name = sp, 
                                 resp.xy = coorxy,
                                 PA.nb.rep = runs.PA, # runs.PA = 0 (I have presence-absence data)
                                 PA.nb.absences = nb.PA, # nb.PA = 0 (I have presence-absence data)
                                 PA.strategy = NULL,# Strategy of pseudoabsences selection
                                 PA.user.table = PA.table) # PA.table = NULL (I have presence-absence data)

# 6. Models calibration 
# Load cross-validation blocks
cur.blocks <- readRDS(paste0(initial.wd, "/data/", sp, "_blocks.RDS")) # object cur.blocks$biomod_table

model.runs <- BIOMOD_Modeling(bm.format = run.data, 
                              models =  c("GLM", "ANN","MARS", "GBM","FDA","RF","MAXNET"),
                              CV.strategy = "user.defined",
                              CV.user.table = cur.blocks, # For the k-fold CV
                              var.import = 3, 
                              metric.eval = c("TSS", "ROC"), 
                              CV.do.full.models = FALSE, 
                              prevalence = NULL, 
                              scale.models = FALSE,
                              bm.options = BIOMOD_ModelingOptions(
                                GLM = list(myFormula = as.formula(paste(sp, "~" , paste(paste("poly(", colnames(run.data@data.env.var), ",2)", sep=""),
                                                                                        collapse = "+"), sep=""))
                                ),
                                GAM = list(algo= "GAM_mgcv",
                                           type = 's_smoother',
                                           method = "REML",  # recommended: might be the default already in gam
                                           k = 20
                                ),
                                GBM = list(n.trees = 50,
                                           bag.fraction = 0.75),
                                RF = list(myFormula = as.formula(as.factor(sp) ~ .),
                                          do.classif = TRUE,
                                          ntree = 1000,
                                          mtry = "default",
                                          sampsize =  c("0" = min(as.numeric(table(run.data@data.species))), "1" = min(as.numeric(table(run.data@data.species)))),
                                          replace = TRUE)
                              )
)

How could I specify in the sampsize argument the fact that the "total data" to calculate the minimum of 0/1 should actually be in relation to how the training data (0/1s) is distributed in each fold? Currently, as the sampsize argument is, it doesn't take into account the data in each fold.

Thank you so much in advance for any help.

Kind regards, Berta

data_biomod.zip

Originally posted by @BRamiroSanchez in https://github.com/biomodhub/biomod2/issues/172#issuecomment-1616775960

rpatin commented 1 year ago

Hi Berta, Thank you for reporting and submitting and nicely formatted issue with a reproducible example :pray:

First concerning the error you get, the issue is that when you do:

sampsize =  c("0" = min(as.numeric(table(run.data@data.species))), "1" = min(as.numeric(table(run.data@data.species)))),

You calculate the number based on the full dataset, instead of the crossvalidated one. If you want an approach specifying sampsize based on the smallest number of 0 or 1 present in the dataset, you can use summary as follow:

library(dplyr)
calib.summary <- 
  summary(run.data, calib.lines =  cur.blocks$biomod_table) %>% 
  filter(dataset == "calibration")

The table contains the summary of all calibration and validation dataset in terms of presences and absences. You can then use it to feed argument sampsize in BIOMOD_ModelingOptions as follows:

sampsize =  c("0" = min(calib.summary$True_Absences),
              "1" = min(calib.summary$Presences))

Now concerning sampsize, I am not sure to understand what you are trying to achieve. By default, if you do not provide sampsize argument, RF in biomod2 will use all presences and absences available in each fold. So no need to specify sampsize if that is what you are trying to achieve.

Alternatively if you want RF to use only a fraction of the data for each calibration. You can set sampsize to lower values, but the values are common accross all cross-validation fold, so you need to use values lower than the minimum number of presences or absences in all folds (which you can get using summary as shown above).

Finally, if you want to have specific sampsize for each one of your fold (e.g. 50% of the data available in the fold), this cannot be achieved for now, but should be possible with next biomod2 release (eta end of summer).

Best regards, Rémi

BRamiroSanchez commented 1 year ago

Hi @rpatin

Thank you so much for your detailed answer and code hints :)

Yes, as in your last suggestion, what I would like is to set the sampsize specific to the calibration data available in each fold. So, each calibrated model from each fold (a total of 4 RF models in my example) would have the sample of both "0s" and "1s" down sized to the minimum available in that fold.

If:

 calib.summary <- 
     summary(run.data, calib.lines = cur.blocks$biomod_table) %>%  
     filter(dataset == "calibration")

 unique(calib.summary)
       dataset  run      PA Presences True_Absences Pseudo_Absences Undefined
1  calibration RUN1 allData       221           103               0        NA
5  calibration RUN2 allData       192            91               0        NA
9  calibration RUN3 allData       196           102               0        NA
13 calibration RUN4 allData       237            88               0        NA

Then, the sampsize for the RF model from fold1 ("cluster1_allData_RUN1_RF") would be:

sampsize =  c("0" = 103 , # E.g. RUN1: "0" = min(calib.summary$True_Absences, calib.summary$Presences)
              "1" = 103)  # E.g. RUN1: "1" = min(calib.summary$True_Absences, calib.summary$Presences)

Looking forward to the next release :)

Cheers, Berta

rpatin commented 1 year ago

Hi Berta, Thank you for the update :pray: I now fully understand your issue, unfortunately it pertains to the third part of my points (have specific sampsize for each one of your fold), which is not possible yet, but should be in a near future. Meanwhile If your partition are not too much imbalanced (similar number of presences/absences, you can still go with the minimum number of presences/absences and set the same value for all folds. This will ensure a prevalence of 50% in your tree building and should not be that much different from the ideal solution (given that imbalance is limited).

Cheers, Rémi

BRamiroSanchez commented 1 year ago

Hi Rémi,

Thank you for confirming; I might then proceed with your suggestion in the meantime. Thank you again for being so responsive! :)

Cheers, Berta

Farewe commented 9 months ago

Hi @rpatin , I see that this issue has been resolved and implemented, which is great!

However I am not sure how to do that. Could you give here an example of how to define the argument sampsize such that we have a balanced sample size for each fold, as in your message here?

Finally, if you want to have specific sampsize for each one of your fold (e.g. 50% of the data available in the fold), this cannot be achieved for now, but should be possible with next biomod2 release (eta end of summer).

Many thanks! Boris

Farewe commented 8 months ago

For anyone interested, example code can be found in #393.

biomodhub / biomod2

RF downsample with k-fold spatial block cross-validation #297