biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.
85 stars 22 forks source link

Error when using factor variables #268

Closed ZoGo94 closed 1 year ago

ZoGo94 commented 1 year ago

I have an issue when using factor data in the modelling process – when I use the same environmental data layers (as a spatraster or as a rasterstack, ive tried both) for modelling and generating response curves and when projecting the results using the same environmental variables, it thinks there are new levels in the categorical variables when the exact same files have been used.

Error when plotting response curves and also when projecting:

Error in { : 
  task 1 failed - "task 1 failed - "factor Aspect has new levels 0""

Code run:

> DataSpecies <- read.csv("directory”)
> myRespName <- 'W7'
> myResp <- as.numeric(DataSpecies[, myRespName])
> myRespXY <- DataSpecies[,c("X","Y")]

> path <- "directory"
> lst <- list.files(path=path, pattern='.asc$', full.names = T)
> myExpl <- terra::rast(lst)

> myExpl$SoilDepth = as.factor(myExpl$SoilDepth)
> myExpl$SoilType = as.factor(myExpl$SoilType)
> myExpl$Aspect = as.factor(myExpl$Aspect)

> is.factor(myExpl$SoilDepth)
[1] TRUE
> is.factor(myExpl$SoilType)
[1] TRUE
> is.factor(myExpl$Aspect)
[1] TRUE

myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
                                     expl.var = myExpl,
                                     resp.xy = myRespXY,
                                     resp.name = myRespName,
                                     PA.nb.rep = 1,
                                     PA.strategy = "random",
                                     PA.nb.absences = 300)

> show(myBiomodData)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.formated.data -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

dir.name =  .

sp.name =  W7

     85 presences,  0 true absences and  113 undefined points in dataset

     15 explanatory variables

     Aspect     Elevation       ElevationRoughness    GDD5Mean      MeanMinFebTemp   
 5      :36   Min.   :  2.522   Min.   : 1.319     Min.   : 693.5   Min.   :-2.5901  
 9      :26   1st Qu.: 67.901   1st Qu.: 3.691     1st Qu.:1360.8   1st Qu.: 0.2811  
 7      :24   Median :133.132   Median : 6.464     Median :1664.7   Median : 0.8392  
 6      :23   Mean   :163.532   Mean   : 7.679     Mean   :1595.1   Mean   : 0.7732  
 2      :22   3rd Qu.:217.984   3rd Qu.:10.212     3rd Qu.:1841.3   3rd Qu.: 1.3441  
 4      :22   Max.   :649.869   Max.   :35.414     Max.   :2301.3   Max.   : 3.3463  
 (Other):45                                                                          
 PminusPETInterVarSum PminusPETInterVarTrend PminusPETmean         Slope        
 Min.   :-217.5       Min.   :-9.357         Min.   : -85.21   Min.   : 0.4813  
 1st Qu.: 130.3       1st Qu.: 5.245         1st Qu.: 224.57   1st Qu.: 2.7964  
 Median : 218.3       Median : 7.277         Median : 434.16   Median : 4.5782  
 Mean   : 219.9       Mean   : 7.042         Mean   : 605.47   Mean   : 6.6768  
 3rd Qu.: 291.8       3rd Qu.: 9.670         3rd Qu.: 875.40   3rd Qu.: 8.7441  
 Max.   : 651.2       Max.   :21.251         Max.   :2989.75   Max.   :28.5348  

  SoilCNratio     SoilDepth   SoilNconc        SoilPconc         SoilpH      SoilType
 Min.   : 9.612   0:  0     Min.   :0.1842   Min.   :26.01   Min.   :4.325   0: 0    
 1st Qu.:11.568   1: 24     1st Qu.:0.3202   1st Qu.:39.39   1st Qu.:5.191   1:51    
 Median :11.851   2: 22     Median :0.4767   Median :43.66   Median :5.881   2:16    
 Mean   :13.651   3: 28     Mean   :0.6195   Mean   :44.09   Mean   :5.899   3:15    
 3rd Qu.:14.529   4: 14     3rd Qu.:0.7062   3rd Qu.:47.44   3rd Qu.:6.629   4:86    
 Max.   :30.152   5:110     Max.   :1.8358   Max.   :84.17   Max.   :7.816   5:17    
                                                                             6:13    

 1 Pseudo Absences dataset available ( PA1 ) with  113 (PA1) pseudo absences

myBiomodOption <- BIOMOD_ModelingOptions()

myBiomodCV <- bm_CrossValidation(bm.format = myBiomodData,
                                 strategy = "block",
                                 do.full.models = FALSE)

myBiomodModelOut <- BIOMOD_Modeling(myBiomodData,
                                    models = c('RF', 'MAXNET', 'GLM', 'MARS', 'GBM', 'FDA', 'CTA', 'ANN'),
                                    bm.options = myBiomodOption,
                                    CV.strategy = 'block',
                                    CV.user.table = myBiomodCV,
                                    var.import = 10,
                                    metric.eval = c("KAPPA", "TSS", "ROC"),
                                    CV.do.full.models = FALSE,
                                    seed.val = 42)

> show(myBiomodModelOut)

Modeling folder : .

Species modeled : W7

Modeling id : 1685612630

Considered variables : Aspect Elevation ElevationRoughness GDD5Mean MeanMinFebTemp 
PminusPETInterVarSum PminusPETInterVarTrend PminusPETmean Slope SoilCNratio SoilDepth SoilNconc 
SoilPconc SoilpH SoilType

Computed Models :  W7_PA1_RUN1_RF W7_PA1_RUN1_MAXNET W7_PA1_RUN1_GLM W7_PA1_RUN1_MARS 
W7_PA1_RUN1_GBM W7_PA1_RUN1_FDA W7_PA1_RUN1_CTA W7_PA1_RUN1_ANN W7_PA1_RUN2_RF 
W7_PA1_RUN2_MAXNET W7_PA1_RUN2_GLM W7_PA1_RUN2_MARS W7_PA1_RUN2_GBM W7_PA1_RUN2_FDA 
W7_PA1_RUN2_CTA W7_PA1_RUN2_ANN W7_PA1_RUN3_RF W7_PA1_RUN3_MAXNET W7_PA1_RUN3_GLM 
W7_PA1_RUN3_MARS W7_PA1_RUN3_GBM W7_PA1_RUN3_FDA W7_PA1_RUN3_CTA W7_PA1_RUN3_ANN W7_PA1_RUN4_RF 
W7_PA1_RUN4_MAXNET W7_PA1_RUN4_GLM W7_PA1_RUN4_MARS W7_PA1_RUN4_GBM W7_PA1_RUN4_FDA 
W7_PA1_RUN4_CTA W7_PA1_RUN4_ANN

Failed Models :  none

myBiomodEM <- BIOMOD_EnsembleModeling(myBiomodModelOut,
                                      models.chosen = 'all',
                                      em.by = 'all',
                                      metric.select = c('ROC'),
                                      metric.select.thresh = c(0.7),
                                      metric.select.dataset = 'validation',
                                      metric.eval = c('TSS', 'ROC', 'KAPPA'),
                                      var.import = 10,
                                      em.algo = c('EMwmean'),
                                      seed.val = 42)

> show(myBiomodEM)

sp.name : W7

expl.var.names : Aspect Elevation ElevationRoughness GDD5Mean MeanMinFebTemp 
PminusPETInterVarSum PminusPETInterVarTrend PminusPETmean Slope SoilCNratio SoilDepth SoilNconc 
SoilPconc SoilpH SoilType

models computed: W7_EMwmeanByROC_mergedData_mergedRun_mergedAlgo

models failed: none

responseplots <- bm_PlotResponseCurves(myBiomodEM,
                                       models.chosen = 'all',
                                       fixed.var = 'mean')

Error in { : 
  task 1 failed - "task 1 failed - "factor Aspect has new levels 0""

I get the same error when using this code:

responseplots <- bm_PlotResponseCurves(myBiomodEM,
                                       models.chosen = 'all',
                                       new.env = get_formal_data(myBiomodEM, "expl.var"),
                                       fixed.var = 'mean')

And the same error when trying to project the ensemble with the same data.

Also I'm only getting this error since the May 2023 package update - it ran fine before. I've also noticed that the validation evaluation results indicate my models are poorer than they were before the package update

sessionInfo()

> sessionInfo()
R version 4.3.0 (2023-04-21 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8    LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.utf8    

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gbm_2.1.8.1            tidyterra_0.4.0        ENMeval_2.0.4          magrittr_2.0.3         terra_1.7-29           spatstat_3.0-6        
 [7] spatstat.linnet_3.1-1  spatstat.model_3.2-4   rpart_4.1.19           spatstat.explore_3.2-1 nlme_3.1-162           spatstat.random_3.1-5 
[13] spatstat.geom_3.2-1    spatstat.data_3.0-1    dplyr_1.1.2            tidyr_1.3.0            ggtext_0.1.2           usdm_1.1-18           
[19] gridExtra_2.3          ggplot2_3.4.2          pROC_1.18.2            raster_3.6-20          sp_1.6-0               biomod2_4.2-3         

loaded via a namespace (and not attached):
 [1] shape_1.4.6            gtable_0.3.3           maxnet_0.1.4           spatstat.sparse_3.0-1  lattice_0.21-8         vctrs_0.6.2           
 [7] tools_4.3.0            spatstat.utils_3.0-3   generics_0.1.3         goftest_1.2-3          tibble_3.2.1           PresenceAbsence_1.1.11
[13] fansi_1.0.4            pkgconfig_2.0.3        Matrix_1.5-4           lifecycle_1.0.3        farver_2.1.1           compiler_4.3.0        
[19] stringr_1.5.0          deldir_1.0-9           munsell_0.5.0          dismo_1.3-14           codetools_0.2-19       class_7.3-21          
[25] glmnet_4.1-7           Formula_1.2-5          pillar_1.9.0           MASS_7.3-58.4          iterators_1.0.14       plotmo_3.6.2          
[31] earth_5.3.2            abind_1.4-5            foreach_1.5.2          mda_0.5-3              TeachingDemos_2.12     tidyselect_1.2.0      
[37] stringi_1.7.12         reshape2_1.4.4         purrr_1.0.1            labeling_0.4.2         splines_4.3.0          polyclip_1.10-4       
[43] grid_4.3.0             colorspace_2.1-0       cli_3.6.1              randomForest_4.7-1.1   survival_3.5-5         utf8_1.2.3            
[49] withr_2.5.0            tensor_1.5             scales_1.2.1           plotrix_3.8-2          nnet_7.3-18            mgcv_1.8-42           
[55] rlang_1.1.1            gridtext_0.1.5         Rcpp_1.0.10            glue_1.6.2             xml2_1.3.4             reshape_0.8.9         
[61] R6_2.5.1               plyr_1.8.8       
rpatin commented 1 year ago

Hi @ZoGo94 Thank you for reporting and having filled the issue template :pray:

The modeling and projection with categorical variable had many issues before and we tried to improve its management with the last update. Apparently it also broke some good behaviour, sorry for that :confused:

However it will be quite hard to debug without a reproducible example, or with access to your data (raster and points). Indeed it is generally tightly linked to the specific formatting of the data. If possible could you send the data in question ? Here is my mail: remi.patin@univ-grenoble-alpes.fr.

Concerning the evaluations, we had an error in previous version where validation metric were calculated with a threshold optimized over validation data, instead of the threshold optimized over calibration data. In short it is therefore expected to have lower validation metric. The point was raised in the changelog:

validation metric calculation now properly use the calibration threshold (i.e a threshold optimized on calibration data instead of validation data). This can lead to less optimistic threshold-dependent validation metric.

Best regards, Rémi

ZoGo94 commented 1 year ago

Hi Rémi,

Many thanks for your very fast response and information!

I would be happy to send you the data, however my email provider says I am unable to send to your address and says it is no longer valid?

All the best,

Zoe

I will also add that the error seems to be intermittent - sometimes it works, sometimes it doesn't :/

rpatin commented 1 year ago

Hi Zoe, It is surprising indeed as the mail address had no mistake. You can also try to send them through our university cloud (https://nextcloud.osug.fr/index.php/s/Wk4HybpZ7b4yRTM) - it is set up as upload only, so only I can read the data sent there.

If the error is intermittent it may be linked to your data (presence/pseudo-absence) not sampling all factor levels. Although it is supposed to be done properly internally. But I can imagine edge case where all factor levels are initially sampled but then discarded due to NA in other variables that would then lead to error when projecting with BIOMOD_Projection as some factor levels will be unknown.

If it worked in previous version, it is also possible that the variable was not accounted as a categorical variable at that time.

Best, Rémi

rpatin commented 1 year ago

Hi Zoe, Thanks a lot for the data you sent, it makes it much easier for us to provide help :pray: I corrected some additional internal issues with the management of categorical variables. If you update to the new github version with devtools::install_github('biomodhub/biomod2') it should hopefully work.

  1. Note however that your environmental data had a lot of NA across layers, so some levels disappeared and projection will be restricted to area without any NA in the data. In your situation this exclude most urban center. If you want projection on those area you have to restrict yourself with variables having values in those place as well. Here is what the mask looks like: image
  2. Some of the levels in your data are really poorly represented (e.g. levels 0 in SoilType). They will be modeled but you should be careful with them and it may be better to remove them for the data.
  3. You can derive the mask in question with:
    new.env.mask <-   classify(as.numeric(!any(is.na(myExpl))),
                           matrix(c(0, NA, 
                                    1, 1), 
                                  ncol = 2, byrow = TRUE)) 

    and you can apply it on your Environmental raster with:

    myExpl.masked <- mask(myExpl, new.env.mask)

    With that you can evaluate the frequency of all factor levels for the data that will effectively be used in the model:

    freq(myExpl.masked$SoilType)

    Best, Rémi

ZoGo94 commented 1 year ago

Hi Rémi,Thanks so so much for your time and feedback on this! Really can’t thank you enough!All the best,Zoe