biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.
85 stars 22 forks source link

Help with BIOMOD_FormatingData - [Problem with the number of presence after PA generation] #466

Closed VDjianBiogeo closed 4 months ago

VDjianBiogeo commented 4 months ago

Hello,

I'm trying to extract the pseudo-absences generated by the BIOMOD_FormatingData function in order to tune some RF and GBM models with another package. However, I have some issues with the number of pseudo-absences and presences after the PA generation process.

My original dataset has 649 presences at a global scale. I generated as much pseudo-absences as presences in order to have 10 PA datasets. My guess was that I would end up with 1298 points, with my original 649 presences and 649 pseudo-absences.

  data <- mycto.occ[,c("x", "y", sp)] %>%
    na.omit()
  myRespName <- paste(substr(sp, 1, 1), 
                      str_split(sp, pattern = "_", simplify = T)[,2], sep=".")

  bm_data <- BIOMOD_FormatingData(resp.var= data[,3],
                                  expl.var= env_filtered,
                                  resp.xy = data[,1:2]), 
                                  resp.name = myRespName,
                                  PA.nb.rep = 10, 
                                  PA.nb.absences = as.integer(nrow(data)), 
                                  PA.strategy = 'sre',
                                  PA.sre.quant = 0.025, 
                                  na.rm = TRUE)

However, when I look at the number of rows in bm_data@PA.table, I get 6846 rows. And when I look the number of presences using bm_data@PA.table$PA1[bm_data@PA.table$PA1 == T], I get 1058 presences, wich is more than my original 649 presences.

And when I do a summary on my BIOMOD formated data, I get this result:

summary(bm_data)

       dataset run   PA Presences True_Absences Pseudo_Absences Undefined
1      initial  NA <NA>       409             0               0      6455
2  calibration  NA  PA1       409             0             649        NA
3  calibration  NA  PA2       409             0             649        NA
4  calibration  NA  PA3       409             0             649        NA
5  calibration  NA  PA4       409             0             649        NA
6  calibration  NA  PA5       409             0             649        NA
7  calibration  NA  PA6       409             0             649        NA
8  calibration  NA  PA7       409             0             649        NA
9  calibration  NA  PA8       409             0             649        NA
10 calibration  NA  PA9       409             0             649        NA
11 calibration  NA PA10       409             0             649        NA 

What happened to my presences? How can I extract all PA datasets with the right spatial coordinates associated with the right spatial cell?

Here is my species occurrence file: Data.csv

I cannot join my env file, it is too heavy unfortunately.

Thank you in advance for your help! Valentin

HeleneBlt commented 4 months ago

Hello Valentin,

Indeed, it can be confusing at first !

So :

So if you want to extract the points of PA1 for example: bm_data@data.species[bm_data@PA.table$PA1] will give you a vector with 1 for the presences points and NA for the pseudo absences points. And you can have the coordinates with bm_data@coord[bm_data@PA.table$PA1, ]

I hope it's clearer this way. Don't hesitate if you have other questions!

Hélène

VDjianBiogeo commented 4 months ago

Hello Hélène,

Thank you for the fast reply! I looked at any messages I have during PA generation and I have this one:

-=-=-=-=-=-=-=-=-=-=-=-=-=-= E.antarctica Data Formating -=-=-=-=-=-=-=-=-=-=-=-=-=-=

      ! No data has been set aside for modeling evaluation
      ! No data has been set aside for modeling evaluation

Checking Pseudo-absence selection arguments...

      ! No data has been set aside for modeling evaluation
   > SRE pseudo absences selection

 ! Some NAs have been automatically removed from your data
      ! No data has been set aside for modeling evaluation
      ! No data has been set aside for modeling evaluation
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

I think it's because some of my presences don't have values for all of my environmental parameters and, as such, they are excluded. Am I right? Or is it something else?

I need to extract it in order to tune some random forest and gbm with another package. I was wondering why we can only tune mtry for random forest and not the number of trees or the node complexity but so many more parameters for gbm? Are you planning to add it to the package in the future? And is it possible to train models outside of biomod2 but to use the hyperparameter values obtained in biomod2 (for exemple, n.trees or node complexity for RF models)?

Thank you again for your help! Valentin

HeleneBlt commented 4 months ago

Hi again,

Yes, by default biomod2 removes points if all the environmental variables are not available. You can set the argument na.rm to FALSE if you want to avoid that.

For the tuning, we use the package caret and only mtry are available for the tuning of RF with randomForest.
If there are any updates to the caret package, we'll add them quickly. But we had no immediate plan of switching to another package, sorry 😅 However, you can change all the parameter of RF with bm_ModelingOptions. More info here.

Hope it helps ! Hélène

VDjianBiogeo commented 4 months ago

Hello Hélène,

Ok, that's what I thought, thank you for confirming! Is it better to take them off or to let them in the dataset?

I have another question for the use of bm_ModelingOptions. I have trained GBM and RF models on each PA dataset generated with BIOMOD_FormatingData in order to define the optimal hyperparameters for each. But is it possible to specify them for each PA dataset with bm_ModelingOptions (just like when I select "tuned" as argument in BIOMOD_Modeling) or does the same hyperparameters are used for all PA dataset runs?

If so, is there a way to go past this issue? For example, by going back to the BIOMOD_FormatingData and putting manually each one of the previously generated PA datasets individually in order to specify the hyperparameters for each one of them? Not sure if I'm making sense, sorry if it's confused.

Thanks anyways! Valentin

HeleneBlt commented 4 months ago

Hello Valentin,

It depends on which model you use. From memory, CTA accepts NA without any problem for example, but not RF. Generally, I will advise removing them or using an imputation method to have a full dataset.

In fact, Maya has created bm_ModelingOptions specifically to be able to modify the parameters of each dataset 🥳

If you give myBiomodData to bm_ModelingOptions, you will be able to detail the options for all the different PA datasets.

user.val <-  list ( RF.binary.randomForest.randomForest =  list( "_PA1_allRun" = RF.options, 
                                                                 "_PA2_allRun" = RF.options,
                                                                 "-allData_allRUn" = RF.options))

myOpt <- bm_ModelingOptions(data.type = 'binary',
                            models = c('RF','GBM'),
                            strategy = "user.defined",
                            user.val = user.val,
                            user.base = "bigboss",
                            bm.format = myBiomodData)

BIOMOD_Modeling will automatically give the options of "_PAx_allRun" to all run with PAx.

If you give calib.lines to bm_ModelingOptions, you could also detail the parameters for each run ("_PA1_RUN1","_PA1_RUN2",..)

Don't hesitate if it's not clear !

Hélène

VDjianBiogeo commented 4 months ago

Hello Hélène,

Thank you again for your reply. I'll go without the NA I guess!

Woah, that's amazing! It solves the problems I had with model tuning. Thank you!

However, I have a question concerning this. When I change the parameters for "_allData_allRun" and I look at the bm_ModelingOptions object created, the hyperparameters that I chose are here.

user.val <- list(RF.binary.randomForest.randomForest = list("_allData_allRun" = list("mtry"=rf.param.list[[1]]$mtries,
                                                                                   "ntree"=rf.param.list[[1]]$ntrees,
                                                                                   "nodesize"=rf.param.list[[1]]$min_rows)))

myOpt <- bm_ModelingOptions(data.type = 'binary',
                              models = c("RF"),
                              strategy = "user.defined",
                              user.val = user.val, 
                              user.base = "bigboss",
                              bm.format = bm_data)

myOpt

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.models.options -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

    >  RF options (datatype: binary , package: randomForest , function: randomForest ) :
       ( dataset _allData_allRun )
        -  mtry = 4   (default: 1 )
        -  type = "classification"
        -  ntree = 2730   (default: NULL )
        -  strata = 0 1 Levels: 0 1   (default: NULL )
        -  nodesize = 1   (default: NULL )

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

But when I specify for each of my five PA datasets generated with "_PAx_allRun" and I look at the same object, I can only see the values for the "_allData_allRun":

user.val <- list(RF.binary.randomForest.randomForest = list("_PA1_allRun" = list("mtry"=rf.param.list[[1]]$mtries,
                                                                                  "ntree"=rf.param.list[[1]]$ntrees,
                                                                                  "nodesize"=rf.param.list[[1]]$min_rows),
                                                             "_PA2_allRun" = list("mtry"=rf.param.list[[2]]$mtries,
                                                                                  "ntree"=rf.param.list[[2]]$ntrees,
                                                                                  "nodesize"=rf.param.list[[2]]$min_rows),
                                                             "_PA3_allRun" = list("mtry"=rf.param.list[[3]]$mtries,
                                                                                  "ntree"=rf.param.list[[3]]$ntrees,
                                                                                  "nodesize"=rf.param.list[[3]]$min_rows),
                                                             "_PA4_allRun" = list("mtry"=rf.param.list[[4]]$mtries,
                                                                                  "ntree"=rf.param.list[[4]]$ntrees,
                                                                                  "nodesize"=rf.param.list[[4]]$min_rows),
                                                             "_PA5_allRun" = list("mtry"=rf.param.list[[5]]$mtries,
                                                                                  "ntree"=rf.param.list[[5]]$ntrees,
                                                                                  "nodesize"=rf.param.list[[5]]$min_rows)))

myOpt <- bm_ModelingOptions(data.type = 'binary',
                              models = c("RF"),
                              strategy = "user.defined",
                              user.val = user.val, 
                              user.base = "bigboss",
                              bm.format = bm_data)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Build Modeling Options -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

    >  RF options (datatype: binary , package: randomForest , function: randomForest )...

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Message d'avis :
Dans .BIOMOD.options.dataset.check.args(strategy = strategy, user.val = user.val,  :
  Options will be changed only for a subset of datasets (_PA1_allRun, _PA2_allRun, _PA3_allRun, _PA4_allRun, _PA5_allRun) and not the others (_allData_allRun). 
Please update 'user.val' argument if this is not wanted.

myOpt

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.models.options -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

    >  RF options (datatype: binary , package: randomForest , function: randomForest ) :
       ( dataset _allData_allRun )
        -  mtry = 1
        -  type = "classification"
        -  ntree = 500   (default: NULL )
        -  strata = 0 1 Levels: 0 1   (default: NULL )
        -  nodesize = 5   (default: NULL )

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Why is the parameters I chose for each PA dataset are not shown? Is it because of how I declared them for each PA dataset?

Thank you again! Valentin

HeleneBlt commented 4 months ago

Hello,

Glad you found it useful!

The print method for the options shows by default '_allData_allRun" options but you can choose which dataset you want to see : print( myOpt, dataset = '_PA1_allRun') for example.

Hélène

VDjianBiogeo commented 4 months ago

Hello Hélène,

Again, so fast to reply! Thank you very much for everything :) I'll close the issue with this last comment, you answered all of my questions (for now!) and helped me a lot. Thank you again!

Until next time Valentin