Help with BIOMOD_FormatingData.

mengjiezhang4ds commented 2 months ago

Context and question When I used the BIOMOD_FormatingData of biomod2 to generate several numbers of pseudo absence points, I found that the number of pseudo absence points displayed in the return code did not match the number I set. I am looking forward to your answer very much, thank you very much!

Code used I am using the official case code, and I have pasted the code and output results below.

library(terra)
# Load species occurrences (6 species available)
data(DataSpecies)
# Select the name of the studied species
myRespName <- 'GuloGulo'
# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies[, myRespName])
# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]
# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
data(bioclim_current)
myExpl <- terra::rast(bioclim_current)

# # Transform true absences into potential pseudo-absences
myResp.PA <- ifelse(myResp == 1, 1, NA)

# Format Data with pseudo-absences : random method
myBiomodData.r <- BIOMOD_FormatingData(resp.var = myResp.PA,
                                       expl.var = myExpl,
                                       resp.xy = myRespXY,
                                       resp.name = myRespName,
                                       PA.nb.rep = 4,
                                       PA.nb.absences = 1000,
                                       PA.strategy = 'random')
myBiomodData.r
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.formated.data -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

dir.name =  .

sp.name =  GuloGulo

     661 presences,  0 true absences and  1827 undefined points in dataset

     5 explanatory variables

      bio3            bio4            bio7           bio11             bio12         
 Min.   :10.19   Min.   :   72   Min.   : 54.5   Min.   :-447.75   Min.   :   0.028  
 1st Qu.:21.22   1st Qu.: 2641   1st Qu.:186.0   1st Qu.:-184.32   1st Qu.: 276.493  
 Median :35.00   Median : 6682   Median :306.2   Median :  24.23   Median : 562.931  
 Mean   :40.29   Mean   : 7358   Mean   :310.9   Mean   :  -2.64   Mean   : 853.516  
 3rd Qu.:56.35   3rd Qu.:11752   3rd Qu.:424.6   3rd Qu.: 196.30   3rd Qu.:1200.592  
 Max.   :92.00   Max.   :22314   Max.   :718.0   Max.   : 283.00   Max.   :5431.002  

 4 Pseudo Absences dataset available ( PA1, PA2, PA3, PA4 ) with  1000 (PA1, PA2, PA3, PA4) pseudo absences

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Here, I have set the generation of 1000 pseudo absence points in PA.nb.absences, and I have 661 real presence points. Therefore, the input model should have 661+1000=1661 points. However, the information returned by the code shows that 1827 pseudo absence points have been generated. So, is the parameter PA.nb.absences useless? Or is there another reason why the number of points doesn't match?

# Format Data with pseudo-absences : disk method
myBiomodData.d <- BIOMOD_FormatingData(resp.var = myResp.PA,
                                       expl.var = myExpl,
                                       resp.xy = myRespXY,
                                       resp.name = myRespName,
                                       PA.nb.rep = 4,
                                       PA.nb.absences = 500,
                                       PA.strategy = 'disk',
                                       PA.dist.min = 5,
                                       PA.dist.max = 35)
myBiomodData.d
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.formated.data -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

dir.name =  .

sp.name =  GuloGulo

     661 presences,  0 true absences and  776 undefined points in dataset

     5 explanatory variables

      bio3            bio4            bio7           bio11            bio12         
 Min.   :10.19   Min.   :  373   Min.   : 60.0   Min.   :-447.7   Min.   :   2.944  
 1st Qu.:18.33   1st Qu.: 6730   1st Qu.:294.1   1st Qu.:-254.0   1st Qu.: 259.500  
 Median :24.03   Median : 9819   Median :380.2   Median :-123.3   Median : 468.362  
 Mean   :27.80   Mean   :10015   Mean   :379.9   Mean   :-103.3   Mean   : 608.646  
 3rd Qu.:35.08   3rd Qu.:13468   3rd Qu.:470.9   3rd Qu.:  39.0   3rd Qu.: 796.305  
 Max.   :78.00   Max.   :22314   Max.   :718.0   Max.   : 264.0   Max.   :4698.995  

 4 Pseudo Absences dataset available ( PA1, PA2, PA3, PA4 ) with  500 (PA1, PA2, PA3, PA4) pseudo absences

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

I am also puzzled by the results of other generation methods such as disk, but the number of points still does not match. Why is this?

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8  LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8 LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] biomod2_4.2-4       gganimate_1.0.9     rnaturalearth_1.0.1 tidyterra_0.4.0     terra_1.7-46        rgdal_1.6-7        
 [7] raster_3.6-23       sp_2.1-0            sf_1.0-14           googlesheets4_1.1.1 rgbif_3.7.8         lubridate_1.9.3    
[13] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4         purrr_1.0.2         readr_2.1.4         tidyr_1.3.1        
[19] tibble_3.2.1        ggplot2_3.5.0       tidyverse_2.0.0    

loaded via a namespace (and not attached):
 [1] googledrive_2.1.1       colorspace_2.1-0        class_7.3-20            markdown_1.9            fs_1.6.3               
 [6] gridtext_0.1.5          ggtext_0.1.2            httpcode_0.3.0          rstudioapi_0.15.0       proxy_0.4-27           
[11] farver_2.1.1            urltools_1.7.3          earth_5.3.2             fansi_1.0.6             xml2_1.3.5             
[16] codetools_0.2-18        splines_4.2.0           Formula_1.2-5           jsonlite_1.8.8          mda_0.5-4              
[21] pROC_1.18.4             oai_0.4.0               compiler_4.2.0          httr_1.4.7              Matrix_1.6-1.1         
[26] lazyeval_0.2.2          gargle_1.5.2            cli_3.6.1               later_1.3.1             tweenr_2.0.3           
[31] prettyunits_1.1.1       tools_4.2.0             gtable_0.3.4            glue_1.6.2              rnaturalearthdata_1.0.0
[36] reshape2_1.4.4          maps_3.4.1              rappdirs_0.3.3          Rcpp_1.0.11             PresenceAbsence_1.1.11 
[41] cellranger_1.1.0        vctrs_0.6.5             crul_1.4.0              transformr_0.1.5        iterators_1.0.14       
[46] xfun_0.42               maxnet_0.1.4            lpSolve_5.6.20          timechange_0.2.0        lifecycle_1.0.4        
[51] MASS_7.3-56             scales_1.3.0            hms_1.1.3               promises_1.2.1          curl_5.1.0             
[56] TeachingDemos_2.12      triebeard_0.4.1         rpart_4.1.16            reshape_0.8.9           stringi_1.7.12         
[61] foreach_1.5.2           plotrix_3.8-2           randomForest_4.7-1.1    e1071_1.7-13            commonmark_1.9.1       
[66] rlang_1.1.1             pkgconfig_2.0.3         lattice_0.20-45         labeling_0.4.3          tidyselect_1.2.0       
[71] gbm_2.1.8.1             plyr_1.8.9              magrittr_2.0.3          R6_2.5.1                magick_2.8.0           
[76] generics_0.1.3          DBI_1.1.3               pillar_1.9.0            whisker_0.4.1           withr_3.0.0            
[81] units_0.8-4             survival_3.3-1          abind_1.4-5             nnet_7.3-17             crayon_1.5.2           
[86] xgboost_1.7.5.1         KernSmooth_2.23-20      utf8_1.2.4              tzdb_0.4.0              progress_1.2.2         
[91] grid_4.2.0              data.table_1.14.8       plotmo_3.6.2            classInt_0.4-10         mapchina_0.1.0         
[96] httpuv_1.6.11           openssl_2.1.1           munsell_0.5.0           askpass_1.2.0

HeleneBlt commented 2 months ago

Hello there,

In the sentence 661 presences, 0 true absences and 1827 undefined points in dataset, all the different pseudo-absences are taken into account.

Here, you take PA.nb.rep = 4 with PA.nb.absences = 1000 : four PA datasets will be selected but the PA points will not be identical. So you have 1827 PA points but they are not all present in the four PA datasets. This number can change as they are selected randomly.

You can check with: summary(myBiomodData.r@PA.table) You will have 661 (presences) + 1000 (PA points) = 1661 TRUE for each datasets. TheFALSE points correspond to the PA points chosen for other PA datasets but are not present in this dataset.

It is the same logic with the other PA strategies.

Hope it's clearer this way, Hélène

mengjiezhang4ds commented 2 months ago

Now I have set the number of PA to 2000 and set the pseudo absence point dataset to 4, but the result shows only 1 PA dataset. Why is this?

myBiomodData.r <- BIOMOD_FormatingData(resp.var = myResp.PA,
                                       expl.var = myExpl,
                                       resp.xy = myRespXY,
                                       resp.name = myRespName,
                                       PA.nb.rep = 4,
                                       PA.nb.absences = 2000,
                                       PA.strategy = 'random')
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= GuloGulo Data Formating -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

      ! No data has been set aside for modeling evaluation
      ! No data has been set aside for modeling evaluation

Checking Pseudo-absence selection arguments...

      ! No data has been set aside for modeling evaluation
   > random pseudo absences selection
   > Pseudo absences are selected in explanatory variables
   > All available cells have been selected ( 1827 cells )

      ! No data has been set aside for modeling evaluation
      ! No data has been set aside for modeling evaluation
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
as_tibble(myBiomodData.r@PA.table) %>% 
  pivot_longer(cols = everything(),names_to = "PA",values_to = "TF") %>% 
  group_by(PA,TF) %>% 
  summarise(n = n())
`summarise()` has grouped output by 'PA'. You can override using the `.groups` argument.
# A tibble: 1 × 3
# Groups:   PA [1]
  PA    TF        n
  <chr> <lgl> <int>
1 PA1   TRUE   2488

HeleneBlt commented 2 months ago

Hi !

You should pay attention to this message : > All available cells have been selected ( 1827 cells )

It shows that it is not possible to select 2000 PA points because there are only 1827 possible points. In this case, there is only one possible PA dataset with all the points ( 661 + 1827 =2488 ). So biomod2 will only create this dataset: a way of avoiding confusion with four identical data sets.

Hope it helps, Hélène

mengjiezhang4ds commented 2 months ago

Okay, I noticed that I seem to have entered a dead end, so I've been keeping my eyes on the number of my PA points. Thank you very much for your answer.

biomodhub / biomod2

Help with BIOMOD_FormatingData. #450