biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.
87 stars 22 forks source link

Error in BIOMOD2 4.2-3 [CV.strategy = "user.defined"] #264

Closed vanderleidebastiani closed 1 year ago

vanderleidebastiani commented 1 year ago

Error and context

I am using blockCV package to create spatial blocks to separate train and test folds. I am sampling pseudo-absences in BIOMOD_FormatingData function and then using cv_spatial function (package blockCV) to create spatial blocks. Everything was running fine in the previous version of the biomod2 (4.2-2), but the new version (4.2-3) report an error. I think that it is an inconsistency in the steps related to checking names.

Partial console outputs

CV.user.table
head(spatialBlocks$biomod_table)
      RUN1 RUN2 RUN3 RUN4 RUN5 RUN6  RUN7 RUN8  RUN9 RUN10
[1,] FALSE TRUE TRUE TRUE TRUE TRUE  TRUE TRUE  TRUE  TRUE
[2,]  TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE  TRUE  TRUE
[3,]  TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE  TRUE  TRUE
[4,]  TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE  TRUE  TRUE
[5,]  TRUE TRUE TRUE TRUE TRUE TRUE  TRUE TRUE FALSE  TRUE
[6,]  TRUE TRUE TRUE TRUE TRUE TRUE  TRUE TRUE FALSE  TRUE

Model single models

myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
                                     modeling.id = 'AllModels',
                                     models = c('RF'),
                                     bm.options = myBiomodOptions,
                                     CV.strategy = "user.defined",
                                     CV.user.table = spatialBlocks$biomod_table,
                                     metric.eval = c('TSS','ROC'),
                                     var.import = 2,
                                     seed.val = 42)

-=-=-=-=-=-=-=-=-=-=-=-=-=-= Build Single Models -=-=-=-=-=-=-=-=-=-=-=-=-=-=

Checking Models arguments...

Automatic weights creation to rise a 0.5 prevalence
Creating suitable Workdir...

Checking Cross-Validation arguments...

User defined cross-validation selection 

Error in .check_calib.lines_names(user.table, expected_PA.names = colnames(bm.format@PA.table)) : 
  colnames(calib.lines) must follow the following format: '_PAx_RUNy' with x and y integer

Code used to get the error

Example from BIOMOD_Modeling function (modified)

require(terra) # 1.7-29
require(biomod2) # 4.2-3
require(blockCV) # 3.1-1
require(sf) # 1.0-12

# Load species occurrences (6 species available)
data(DataSpecies)
head(DataSpecies)

# Select the name of the studied species
myRespName <- 'GuloGulo'

# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies[, myRespName])

# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]

# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
data(bioclim_current)
myExpl <- terra::rast(bioclim_current)

# Remove absences (transform to NA)
myResp <- ifelse(myResp==0, NA, myResp)
sum(myResp, na.rm = TRUE)

# Format Data
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
                                     expl.var = myExpl,
                                     resp.xy = myRespXY,
                                     resp.name = myRespName,
                                     PA.nb.rep = 10,
                                     PA.nb.absences = sum(myResp, na.rm = TRUE),
                                     PA.strategy = "random")
class(myBiomodData)

# Create default modeling options
myBiomodOptions <- BIOMOD_ModelingOptions()

# blockCV steps - Extract presence+pseudo-absence from BIOMOD_FormatingData object
myPoints <- cbind(myBiomodData@data.species, myBiomodData@coord)
colnames(myPoints) <- c("ocor","x","y")
pa_data <- sf::st_as_sf(myPoints, 
                        coords = c("x", "y"), 
                        crs = 4326)

spatialBlocks <- blockCV::cv_spatial(x = pa_data,
                  column = "ocor",
                  r = myExpl,
                  hexagon = FALSE,
                  biomod2 = TRUE, 
                  size = 1000000,
                  k = 10,
                  selection = "random")
# CV.user.table
head(spatialBlocks$biomod_table)

# Model single models
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
                                    modeling.id = 'AllModels',
                                    models = c('RF'),
                                    bm.options = myBiomodOptions,
                                    CV.strategy = "user.defined",
                                    CV.user.table = spatialBlocks$biomod_table,
                                    metric.eval = c('TSS','ROC'),
                                    var.import = 2,
                                    seed.val = 42)
myBiomodModelOut

# Work fine in biomod2 v4.2-2
# myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
#                                     modeling.id = 'AllModels',
#                                     models = c('RF'),
#                                     bm.options = myBiomodOptions,
#                                     data.split.table = spatialBlocks$biomod_table,
#                                     metric.eval = c('TSS','ROC'),
#                                     var.import = 2,
#                                     seed.val = 42)

Environment Information

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sf_1.0-12     blockCV_3.1-1 biomod2_4.2-3 terra_1.7-29 

loaded via a namespace (and not attached):
 [1] s2_1.1.3               utf8_1.2.3             generics_0.1.3        
 [4] class_7.3-22           KernSmooth_2.23-21     stringi_1.7.12        
 [7] lattice_0.21-8         plotmo_3.6.2           magrittr_2.0.3        
[10] pROC_1.18.2            grid_4.3.0             iterators_1.0.14      
[13] gbm_2.1.8.1            foreach_1.5.2          plyr_1.8.8            
[16] Matrix_1.5-4           e1071_1.7-13           nnet_7.3-19           
[19] DBI_1.1.3              reshape_0.8.9          Formula_1.2-5         
[22] survival_3.5-5         mgcv_1.8-42            fansi_1.0.4           
[25] scales_1.2.1           codetools_0.2-19       PresenceAbsence_1.1.11
[28] abind_1.4-5            cli_3.6.1              rlang_1.1.1           
[31] units_0.8-2            munsell_0.5.0          splines_4.3.0         
[34] withr_2.5.0            plotrix_3.8-2          tools_4.3.0           
[37] reshape2_1.4.4         earth_5.3.2            dplyr_1.1.2           
[40] colorspace_2.1-0       ggplot2_3.4.2          vctrs_0.6.2           
[43] R6_2.5.1               rpart_4.1.19           proxy_0.4-27          
[46] classInt_0.4-9         lifecycle_1.0.3        stringr_1.5.0         
[49] randomForest_4.7-1.1   MASS_7.3-60            pkgconfig_2.0.3       
[52] pillar_1.9.0           gtable_0.3.3           glue_1.6.2            
[55] Rcpp_1.0.10            TeachingDemos_2.12     tibble_3.2.1          
[58] tidyselect_1.2.0       rstudioapi_0.14        mda_0.5-3             
[61] farver_2.1.1           nlme_3.1-162           wk_0.7.3              
[64] maxnet_0.1.4           compiler_4.3.0         sp_1.6-0     
rpatin commented 1 year ago

Hello @vanderleidebastiani Thank you for using our new issue template and providing a well-formated issue :pray: We have updated our management of pseudo-absences and cross-validation in our latest biomod2 release and now when using cross-validation with multiple pseudo-absences dataset you need to provide the cross-validation information (as TRUE/FALSE) for each one of your pseudo-absence (PA) dataset (more info here). And the column names must follow some rules as indicated by the error:

colnames(calib.lines) must follow the following format: '_PAx_RUNy' with x and y integer

Therefore:

  1. If you have 10 PA dataset and want 10 cross-validation per PA dataset, you need a data.frame with 10x10 columns properly named: _PA1_RUN1, _PA1_RUN2, ... , _PA1_RUN10, _PA2_RUN1, ...
  2. If you wanted only one cross-validation per PA dataset you can then use a a data.frame with only 10 columns named: _PA1_RUN1, _PA2_RUN1, ... , _PA10_RUN1

Here you have a data.frame that have column names RUN1, RUN2, ... that does not match the new format we ask for. I hope this is clearer. If not, feel free to ask additonal questions.

Best, Rémi

vanderleidebastiani commented 1 year ago

Hi @rpatin

Thanks for your help. I want to run with the first options (10 PA datasets and want 10 cross-validations). I tried to make use of the same spatial block with all pseudo-absences sets. The function passes across the previous error, however how all models fail (the models worked fine in biomod2 4.2-2). Am I making some formatting mistake in CV.user.table or would it be another problem?

I have an additional question unrelated. How to force the random forest function to perform classification or regression options?

Error

Model=Breiman and Cutler's random forests for classification and regression
    RF modeling...Error in na.fail.default(structure(list(GuloGulo = structure(c(2L, 2L,  : 
  missing values in object
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'object' in selecting a method for function 'predict': object 'model.bm' not found
head(CVTable)

_PA1_RUN1 _PA1_RUN2 ... _PA10_RUN9 _PA10_RUN10
[1,]      TRUE      TRUE       ...      TRUE       FALSE
[2,]      TRUE      TRUE       ...      TRUE        TRUE
[3,]      TRUE      TRUE       ...      TRUE        TRUE
[4,]      TRUE      TRUE       ...      TRUE        TRUE

Code used to get the error (continuation)

nK <- 10
nPseudo <- 10

CVTable <- matrix(NA, nrow = nrow(spatialBlocks$biomod_table), ncol = nK*nPseudo)
colnames(CVTable) <- paste0("x", seq_len(nK*nPseudo))
for(i in 1:nPseudo){
  CVTable[,c(i*nK-(nK-1)):c(i*nK)] <- spatialBlocks$biomod_table
  colnames(CVTable)[c(i*nK-(nK-1)):c(i*nK)] <- paste0("_PA",i, "_RUN", seq_len(nK))
}
head(CVTable)

myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
                                    modeling.id = 'AllModels',
                                    models = c('RF'),
                                    bm.options = myBiomodOptions,
                                    CV.strategy = "user.defined",
                                    CV.user.table = CVTable,
                                    CV.do.full.models = FALSE,
                                    metric.eval = c('TSS','ROC'),
                                    var.import = 2,
                                    seed.val = 42)
myBiomodModelOut
rpatin commented 1 year ago

Hi @vanderleidebastiani, Thank you for the additional information and sorry for this oversight in our code :pray: The proper way to format your CV table should have NA in cells where the point is not included in the given PA dataset. However we intended to have a check that could correct such table (as the info can be found internally in the PA table) but a small coding oversight made it fail as you observed ... You can:

  1. update to the github version that I just corrected: devtools::install_github('biomodhub/biomod2')
  2. alternatively if you want to keep the current CRAN version you can reformat your CV table to have NA when the point is not included in the given PA dataset.

On a side note, did you know that you can do block cross-validation within biomod2 as well ? For sure using external BlockCV may have additionnal flexibility though.

Best, Rémi

vanderleidebastiani commented 1 year ago

Hi @rpatin

Thanks so much. I installed the development version and work fine. This improvement makes it much easier to use.

About my question, I can force regression or classification in BIOMOD_ModelingOptions function (argument do.classif = TRUE/FALSE. Just to register.

Best, Vanderlei

rpatin commented 1 year ago

Hi Vanderlei, Indeed I forgot the other question, but you found the answer :+1: Best, Rémi