ConsBiol-unibern / SDMtune

Performs Variables selection and model tuning for Species Distribution Models (SDMs). It provides also several utilities to display results.
https://consbiol-unibern.github.io/SDMtune/
Other
25 stars 8 forks source link

[BUG] modelReport() using RF method: type of predictors in new data do not match that of the training data. #11

Closed scrameri closed 3 years ago

scrameri commented 3 years ago

Dear Sergio et al.,

I've been trying out SDMtune, and I really like the streamlined analysis approach, visual feedback, and the genetic algorithm for reducing the hyperparameter search space. Good job!

Today I experimented with different model methods, and all works fine so far with the Maxnet, Maxent, BRT and ANN methods. However, there is an issue with the RF method, see BUG report below.

The same error appears using my own data, after variable selection, hyperparameter tuning and model parsimony optimization. The error message suggests that predict.randomForest() cannot handle the passed argument newdata, but I couldn't figure out what happens.

Am I doing something wrong? Any help would be warmly appreciated.

Many thanks and best wishes from Zurich, Simon

Describe the bug modelReport() with the RF method cannot write predicted distribution map using the default virtualSp dataset.

To Reproduce

library(SDMtune)

# Acquire environmental variables
files <- list.files(path = file.path(system.file(package = "dismo"), "ex"),
                    pattern = "grd", full.names = TRUE)
predictors <- raster::stack(files)

# Prepare presence and background locations
p_coords <- virtualSp$presence
bg_coords <- virtualSp$background

# Create SWD object
data <- prepareSWD(species = "Virtual species", p = p_coords, a = bg_coords,
                   env = predictors, categorical = "biome")

# Split presence locations in training (80%) and testing (20%) datasets
datasets <- trainValTest(data, test = 0.2, only_presence = TRUE)
train <- datasets[[1]]
test <- datasets[[2]]

# Train a model using the RF method
model <- train(method = "RF", data = train)

# Create the report
modelReport(model, type = "cloglog", folder = "testfolder", test = test,
            response_curves = FALSE, only_presence = TRUE, jk = TRUE,
            env = predictors, permut = 2)

── Model Report - method: RF ──────────────────────────────── Virtual species ──
✓ Saving files...
✓ Plotting ROC curve...
✓ Computing thresholds...
- Predicting distribution map...Quitting from lines 113-121 (modelReport.Rmd) 
Error in predict.randomForest(object@model, data, type = "prob") : 
  Type of predictors in new data do not match that of the training data.

Expected behavior The modelReport() function is expected to run through using various model methods.

Add here the error message:

Error in predict.randomForest(object@model, data, type = "prob") : 
  Type of predictors in new data do not match that of the training data.

Additional Context

> model
Object of class SDMmodel 
Method: RF 

Species: Virtual species 
Presence locations: 320 
Absence locations: 5000 

Model configurations:
--------------------
mtry: 3
ntree: 500
nodesize: 1

Variables:
---------
Continuous: bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 
Categorical: biome

> model@model@model

Call:
 randomForest(x = x, y = as.factor(p), ntree = ntree, mtry = mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 9.3%
Confusion matrix:
     0   1 class.error
0 4825 175       0.035
1  320   0       1.000
> test
Object of class SWD 

Species: Virtual species 
Presence locations: 80 
Absence locations: 5000 

Variables:
---------
Continuous: bio1 bio12 bio16 bio17 bio5 bio6 bio7 bio8 
Categorical: biome 
> predictors
class      : RasterStack 
dimensions : 192, 186, 35712, 9  (nrow, ncol, ncell, nlayers)
resolution : 0.5, 0.5  (x, y)
extent     : -125, -32, -56, 40  (xmin, xmax, ymin, ymax)
crs        : +proj=longlat +datum=WGS84 +no_defs 
names      : bio1, bio12, bio16, bio17, bio5, bio6, bio7, bio8, biome 
min values :  -23,     0,     0,     0,   61, -212,   60,  -66,     1 
max values :  289,  7682,  2458,  1496,  422,  242,  461,  323,    14 
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] kableExtra_1.3.1 SDMtune_1.1.3   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5          highr_0.8           plyr_1.8.6          pillar_1.4.7        compiler_4.0.2      plotROC_2.2.1      
 [7] tools_4.0.2         digest_0.6.27       viridisLite_0.3.0   evaluate_0.14       lifecycle_0.2.0     tibble_3.0.4       
[13] gtable_0.3.0        lattice_0.20-41     pkgconfig_2.0.3     rlang_0.4.9         cli_2.2.0           rstudioapi_0.13    
[19] yaml_2.2.1          rgdal_1.5-18        xfun_0.19           dismo_1.3-3         dplyr_1.0.2         httr_1.4.2         
[25] stringr_1.4.0       raster_3.4-5        knitr_1.30          xml2_1.3.2          generics_0.1.0      vctrs_0.3.5        
[31] webshot_0.5.2       grid_4.0.2          tidyselect_1.1.0    glue_1.4.2          R6_2.5.0            fansi_0.4.1        
[37] rmarkdown_2.5       sp_1.4-4            farver_2.0.3        ggplot2_3.3.2       purrr_0.3.4         magrittr_2.0.1     
[43] scales_1.1.1        codetools_0.2-18    ellipsis_0.3.1      htmltools_0.5.0     assertthat_0.2.1    randomForest_4.6-14
[49] rvest_0.3.6         colorspace_2.0-0    labeling_0.4.2      stringi_1.5.3       munsell_0.5.0       crayon_1.3.4       
sgvignali commented 3 years ago

Hi Simon, thanks for reporting the problem. The error occurs because predict.randomForest() expects a factor for the variable biome but in the raster stack object the variable is numeric. Please refer to a previous issue for the explanation: https://github.com/ConsBiol-unibern/SDMtune/issues/8.

As you can read in the other issue is possible to solve the problem passing the argument factors to the predict() function. However this was not possible for the modelReport() and I have added the new argument factors.

Please install the GitHub version and let me know if this solve the problem.

scrameri commented 3 years ago

Hi Sergio,

Thanks very much for implementing the factors argument in modelReport(), it works!

By sampling many background points I made sure that all factor levels of categorical.predictors are represented in the training dataset and RF model. Using SDMtune version 1.1.3.9000 and the code below, all types of predictors and all the factor levels match up. One has to make sure that the passed argument factors only contains elements (with factor levels) of variables used in the model. Also works in the case of an empty named list (i.e. when no categorical predictors are used in the model).

Best wishes, Simon

> categorical.predictors
[1] "eco2017"    "geology"    "vegetation"
> used <- names(model@data@data)
> f <- lapply(as.list(data@data[,categorical.predictors]), levels)[used[used %in% categorical.predictors]]
> f # here, geology was not used in the model, and was removed from the list before executing modelReport()
$eco2017
[1] "1" "2" "3" "4" "5" "6" "7"

$vegetation
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "9"  "10" "11" "12" "13" "14" "15" "16" "19" "22" "23" "25"
> modelReport(model = model, folder = folder, test = test, type = NULL,
                          response_curves = FALSE, only_presence = TRUE, 
                          jk = FALSE, env = predictors[[used]], clamp = TRUE, permut = 10,
                          factors = f)
── Model Report - method: RF ──────────────────────────────────── chermezonii ──
✓ Saving files...
✓ Plotting ROC curve...
✓ Computing thresholds...
✓ Predicting distribution map...
✓ Computing variable importance...
✓ Writing model settings...
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] leaflet_2.0.3      ConR_1.3.0         plotROC_2.2.1      ggplot2_3.3.3      kableExtra_1.3.1  
[6] raster_3.4-5       SDMtune_1.1.3.9000 sp_1.4-4