biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.
91 stars 22 forks source link

R session crashed without any warnings or errors being reported #288

Closed tongruiju closed 1 year ago

tongruiju commented 1 year ago

Dear the Team,

I use R 4.2.3 and have updated biomod2. The code became stuck at “SP Data Formatting” for hours, and then the R session crashed without any warnings or errors being reported.

Here comes the model run. Thanks a lot in advance.


library("sp")
The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, will retire in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.
The sp package is now running under evolution status 2
     (status 2 uses the sf package in place of rgdal)

> library("raster")
> library("biomod2")
biomod2 4.2-4-1 loaded.
 /!\ Since version 4.2 biomod2 relies on terra and may thus return SpatRaster that can easily be converted in RasterStack with `stack()`. We apologize for the trouble @('_')@
> library("iterators") 
> library("sf")
Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1;
sf_use_s2() is TRUE
> library("foreach") 
> library("splines")
> library("gam") 
Loaded gam 1.22-2

> #library("ecospat") 
> #library("maxnet")
> #library("doParallel")
> 
> 
> DataSpecies <- read.csv("E:\\PAPER9\\species\\DP_ex.csv")
> 
> attach(DataSpecies)
> head(DataSpecies)
  SP         x        y
1  1 -17303203 -5459405
2  1 -17065528 -1586116
3  1 -17045267 -1588608
4  1 -17044784 -1587985
5  1 -17044382 -1586774
6  1 -17044301 -1587362
> myRespName <- 'SP'
> myResp <- as.numeric(DataSpecies[,myRespName])
> myRespXY <- DataSpecies[,c("x", "y")]
> 
> list <- list.files(path='E:\\PAPER9\\image', full.names=TRUE)
> rasterData <- stack(list)
> myExpl<-raster::stack(rasterData)
> 
> myBiomodData <- BIOMOD_FormatingData(resp.var=myResp,
                                     expl.var=myExpl, 
                                     resp.xy=myRespXY,
                                     resp.name=myRespName,
                                     na.rm=TRUE) 

-=-=-=-=-=-=-=-= SP Data Formating -=-=-=-=-=-=-=-=

      ! No data has been set aside for modeling evaluation
=====================---------|---------|deling evaluation

Best regards, Ruiju

rpatin commented 1 year ago

Dear Ruiju, Thank you for reporting :pray: If the R session crashed, it is likely a memory issue. Did it used to work with the same data and previous biomod2 version ? Which quantity of data do you used ? Can you share the output of:

  1. table(DataSpecies$SP, useNA = "ifany")
  2. show(myExpl)

Thanks in advance, Best regards, Rémi

tongruiju commented 1 year ago

Dear Rémi,

Sorry, there is actually a pop-up window that says “R Session Aborted. R has encountered a fatal error, and the session has been terminated.”

Previously, sometimes the process would crash during the projection section using GBM with the previous version of biomod2, but the SP Data Formatting step completed quickly. Currently, it is stuck in that step. The amount of raster data is 110GB, with a maximum of approximately 6000 points containing presence and background data for each species. The process used up to 73GB of memory, which is the maximum for this step. The total available memory is 512GB.

> table(DataSpecies$SP, useNA = "ifany")

   0    1
3108 3108
> show(myExpl)
class      : RasterStack
dimensions : 29369, 69470, 2040264430, 15  (nrow, ncol, ncell, nlayers)
resolution : 500, 500  (x, y)
extent     : -17367530, 17367470, -7342270, 7342230  (xmin, xmax, ymin, ymax)
crs        : +proj=cea +lat_ts=30 +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs
names      :          bpi9,    curvature,   gebco2022PJ, gebco2022PJ_TRI, gebglo_oaWGS84PJ, gebglo_phWGS84PJ, gebphosWGS84PJ, gebposWGS84PJ, gebsalWGS84PJ, gebsilWGS84PJ, gebtempWGS84PJ, pro_ChlorophyllMean, pro_CurrentVelocityMean, Reproject//51_to_2000,        slope
min values : -8.443740e+03,            ?, -1.091900e+04,               ?,                ?,                ?,              ?,  5.448590e-01,             ?,  3.321393e-01,  -2.080507e+00,                   ?,            2.233892e-11,                     ?,            ?
max values :    3717.81470,            ?,    8520.00000,               ?,                ?,                ?,              ?,     140.64447,             ?,     247.57033,       29.89326,                   ?,                 1.55857,                     ?,            ?

Best, Ruiju

tongruiju commented 1 year ago

Dear Rémi,

I tested the codes on another workstation. The SP Data Formating still takes 1-2hours, which is quite diffrent from the previous biomod2 version, but it does not crash. However, a new error occured in the modeling process, which has not happened before. I used the spatial cross-vaidation. Thank you very much.

> DataSpecies <- read.csv("D:\\PAPER9\\species\\PA_ex.csv")
> 
> attach(DataSpecies)
> head(DataSpecies)
  SP         x        y
1  1 -17363266 -5129765
2  1 -17309639 -5125333
3  1 -17304814 -5125333
4  1 -17303179 -5125333
5  1 -17242098 -5134500
6  1 -17237274 -5129918
> myRespName <- 'SP'
> myResp <- as.numeric(DataSpecies[,myRespName])
> myRespXY <- DataSpecies[,c("x", "y")]
> list <- list.files(path='D:\\PAPER9\\image', full.names=TRUE)
> rasterData <- stack(list)
> myExpl<-raster::stack(rasterData)
> myBiomodData <- BIOMOD_FormatingData(resp.var=myResp,
                                       expl.var=myExpl, 
                                       resp.xy=myRespXY, 
                                       resp.name=myRespName, 
                                       na.rm=TRUE) 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= SP Data Formating -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

      ! No data has been set aside for modeling evaluation
                                          eling evaluation
 !!! Some data are located in the same raster cell. 
          Please set `filter.raster = TRUE` if you want an automatic filtering.
      ! No data has been set aside for modeling evaluation
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> myBiomodCV <- read.csv("D:\\PAPER9\\species\\PA_exB.csv",header=FALSE)
> myBiomodCV  <- as.matrix(myBiomodCV )
> myBiomodOptions <- BIOMOD_ModelingOptions()
> myBiomodModelOut<-BIOMOD_Modeling(bm.format = myBiomodData, bm.options = myBiomodOptions,models=c('GBM', 'RF'), data.split.table = myBiomodCV,  prevalence=0.5, var.import=1, metric.eval=c('TSS', 'KAPPA', 'ROC'),  do.full.models=FALSE, modeling.id=paste(myRespName, "FirstModeling", sep=""))  

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Build Single Models -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Checking Models arguments...

    > Automatic weights creation to rise a 0.5 prevalence
! ignored obsolete argument 'do.full.models' as 'CV.do.full.models' was also given
! ignored obsolete argument 'data.split.table' as 'CV.strategy' was not set to 'user.defined'
Creating suitable Workdir...

Checking Cross-Validation arguments...
Error in .bm_CrossValidation.check.args(bm.format = bm.format, strategy = strategy,  : 
  perc (or CV.perc) is required when strategy = 'random'

Best, Ruiju

rpatin commented 1 year ago

Dear Ruiju, Thank you for the update.

  1. Concerning the crash, it is rather difficult to understand what is happening. I can advise:

    • try using library(terra) ; terraOptions(todisk = TRUE) before running your script, this may help prevent crash from handling very large raster. Beware that if the disk is not an SSD this can then be quite slower.
    • If that does not help and for identifying why the formatting is running so slowly, I would need the data (occurrences + climatic raster) to understand what is happening. You can send them to remi.patin@univ-grenoble-alpes.fr, I can setup a cloud repo if need be.
  2. Concerning the argument error, the cross-validation module was updated in a recent version and argument changed. Here you need to adjust slightly your code (as advised by the error encountered):

    myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData, 
                                    bm.options = myBiomodOptions,
                                    models=c('GBM', 'RF'), 
                                    prevalence=0.5, var.import=1, 
                                    metric.eval=c('TSS', 'KAPPA', 'ROC'),  
                                    CV.user.table = myBiomodCV,  
                                    CV.strategy = "user.defined",
                                    modeling.id=paste(myRespName, "FirstModeling", sep=""))  

Beware, however that the formatting of user provided CV table may have changed slightly and you may need to adapt the formatting of myBiomodCV.

Best, Rémi

tongruiju commented 1 year ago

Dear Rémi,

I added the headline of myBiomodCV as following,

_allData_RUN1   _allData_RUN2   _allData_RUN3   _allData_RUN4       _allData_RUN5
FALSE   TRUE    TRUE    TRUE    TRUE
FALSE   TRUE    TRUE    TRUE    TRUE

The species data as following,

SP  x   y
1   -17363265.75    -5129765.334
1   -17309638.68    -5125332.961

And i got the error below,

> myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData, 
+                                     bm.options = myBiomodOptions,
+                                     models=c('GBM', 'RF'), 
+                                     prevalence=0.5, var.import=1, 
+                                     metric.eval=c('TSS', 'KAPPA', 'ROC'),  
+                                     CV.user.table = myBiomodCV,  
+                                     CV.strategy = "user.defined",
+                                     modeling.id=paste(myRespName, "FirstModeling", sep="")) 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Build Single Models -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Checking Models arguments...

    > Automatic weights creation to rise a 0.5 prevalence
Creating suitable Workdir...

Checking Cross-Validation arguments...
Error in .bm_CrossValidation.check.args(bm.format = bm.format, strategy = strategy,  : 
  user.table must have as many rows (dim1) than your species as data

I wonder if the deletion of multiple points within a single cell is done automatically and therefore causes the error. Is it possible to retain the multipoints?

Best, Ruiju

rpatin commented 1 year ago

Dear Ruiju, By default option filter.raster in BIOMOD_FormattingData is set to FALSE so this should not be the issue here. Maybe you can try with na.rm = FALSE, this could be the issue here. (This may make BIOMOD_FormattingData fail, in which case it means that you have to remove points with NA in environmental variables beforehand) Can you share the output of:

Did you have a specific method for cross-validation ? If not you can also use the method integrated within biomod2 which will function more easily with NA and/or filter.raster.

Best, Rémi

tongruiju commented 1 year ago

Dear Rémi,

Thanks a lot. Using the setting na.rm=FALSE still does not work. The rows of myBiomodCV are shown as 3898, as seen below. however, there are actually 3897 lines of records in the data, and one headline. s 3898. You can see below. But actually it is 3897 lines of records, and one being the header. Yes,we do have specific method of spatial cross vallidation. BTW, is the difference between trained models predicted using different versions biomod2 4.2-1 and biomod2 4.2-4-1, larger than the difference between models predicted using same version biomod2 4.2-1 ?

myBiomodData <- BIOMOD_FormatingData(resp.var=myResp,expl.var=myExpl, resp.xy=myRespXY, resp.name=myRespName, na.rm=FALSE) 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= SP Data Formating -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

      ! No data has been set aside for modeling evaluation
                                          eling evaluation
 !!! Some data are located in the same raster cell. 
          Please set `filter.raster = TRUE` if you want an automatic filtering.
      ! No data has been set aside for modeling evaluation
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> myBiomodCV <- read.csv("D:\\PAPER9\\species\\PA_exB.csv",header=FALSE)
> myBiomodCV  <- as.matrix(myBiomodCV )
> myBiomodOptions <- BIOMOD_ModelingOptions()
> myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData, 
+                                     bm.options = myBiomodOptions,
+                                     models=c('GBM', 'RF'), 
+                                     prevalence=0.5, var.import=1, 
+                                     metric.eval=c('TSS', 'KAPPA', 'ROC'),  
+                                     CV.user.table = myBiomodCV,  
+                                     CV.strategy = "user.defined",
+                                     modeling.id=paste(myRespName, "FirstModeling", sep="")) 

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Build Single Models -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Checking Models arguments...

    > Automatic weights creation to rise a 0.5 prevalence
Creating suitable Workdir...

Checking Cross-Validation arguments...
Error in .bm_CrossValidation.check.args(bm.format = bm.format, strategy = strategy,  : 
  user.table must have as many rows (dim1) than your species as data
> 
> nrow(myBiomodCV)
[1] 3898
> show(myBiomodData)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.formated.data -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

dir.name =  .

sp.name =  SP

     789 presences,  3108 true absences and  0 undefined points in dataset

     15 explanatory variables

      bpi9             curvature          gebco2022PJ    gebco2022PJ_TRI  
 Min.   :-257.1729   Min.   :-1.26e-01   Min.   :-8905   Min.   :  0.000  
 1st Qu.:  -3.7407   1st Qu.:-1.20e-03   1st Qu.:-2946   1st Qu.:  6.708  
 Median :   0.0864   Median : 0.00e+00   Median :-1144   Median : 17.464  
 Mean   :  -1.0328   Mean   : 1.64e-06   Mean   :-1761   Mean   : 45.285  
 3rd Qu.:   3.5555   3rd Qu.: 1.20e-03   3rd Qu.: -313   3rd Qu.: 45.133  
 Max.   : 442.3704   Max.   : 1.36e-01   Max.   :   -1   Max.   :795.627  
 gebglo_ocWGS84PJ gebglo_phWGS84PJ gebphosWGS84PJ    gebposWGS84PJ    
 Min.   :0.5354   Min.   :7.560    Min.   :0.04364   Min.   :  5.731  
 1st Qu.:1.1274   1st Qu.:7.857    1st Qu.:0.85518   1st Qu.: 60.251  
 Median :1.8251   Median :7.974    Median :1.16032   Median : 80.841  
 Mean   :2.0208   Mean   :7.945    Mean   :1.37619   Mean   : 75.198  
 3rd Qu.:2.6954   3rd Qu.:8.035    3rd Qu.:2.10400   3rd Qu.: 92.027  
 Max.   :6.2228   Max.   :8.167    Max.   :3.20775   Max.   :104.556  
 gebsalWGS84PJ    gebsilWGS84PJ      gebtempWGS84PJ   pro_ChlorophyllMean
 Min.   : 5.558   Min.   :  0.9359   Min.   :-1.532   Min.   :0.004262   
 1st Qu.:34.686   1st Qu.:  6.4019   1st Qu.: 1.818   1st Qu.:0.004363   
 Median :34.911   Median : 13.4248   Median : 3.644   Median :0.006174   
 Mean   :34.783   Mean   : 39.8260   Mean   : 5.150   Mean   :0.055474   
 3rd Qu.:35.099   3rd Qu.: 56.5542   3rd Qu.: 7.141   3rd Qu.:0.056123   
 Max.   :38.742   Max.   :174.8513   Max.   :28.940   Max.   :1.499119   
 pro_CurrentVelocityMean Reproject_epc_av_1951_to_2000     slope        
 Min.   :0.00314         Min.   : 0.2631               Min.   : 0.0000  
 1st Qu.:0.03791         1st Qu.: 3.4568               1st Qu.: 0.2641  
 Median :0.05817         Median : 8.1798               Median : 0.7193  
 Mean   :0.07654         Mean   :12.3095               Mean   : 1.9046  
 3rd Qu.:0.09491         3rd Qu.:17.9012               3rd Qu.: 1.9449  
 Max.   :0.77660         Max.   :75.7245               Max.   :31.3036  

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Best regards, Ruiju

rpatin commented 1 year ago

Dear Ruiju, Thank you for the additional explanation :pray: Can you try by changing:

> myBiomodCV <- read.csv("D:\\PAPER9\\species\\PA_exB.csv",header=FALSE)

into

> myBiomodCV <- read.csv("D:\\PAPER9\\species\\PA_exB.csv",header=TRUE)

I fear that your header was understood as data hence the mismatch in dimension. If you call colnames(myBiomodCV) this should give you:

_allData_RUN1   _allData_RUN2   _allData_RUN3   _allData_RUN4       _allData_RUN5

Concerning your other question:

BTW, is the difference between trained models predicted using different versions biomod2 4.2-1 and biomod2 4.2-4-1, larger than the difference between models predicted using same version biomod2 4.2-1 ?

I am not sure to understand what you mean. There has been quite a lots of changes since 4.2-1 (see here). Among the changes, one may be impacting your perception of model performance ( bugfix in 4.2-3):

validation metric calculation now properly use the calibration threshold (i.e a threshold optimized on calibration data instead of validation data). This can lead to less optimistic threshold-dependent validation metric.

which can affect perceived model performance, but the less optimistic results in current biomod2 version are more reliable than the old one.

Best regards, Rémi