Help with BIOMOD_EnsembleModeling - use the k-fold cross validation and sum all k-binary maps, restricting the final area of occupancy to pixels having sum equal to k

biomodhub / biomod2

BIOMOD is a computer platform for ensemble forecasting of species distributions, enabling the treatment of a range of methodological uncertainties in models and the examination of species-environment relationships.

87 stars 22 forks source link

Help with BIOMOD_EnsembleModeling - use the k-fold cross validation and sum all k-binary maps, restricting the final area of occupancy to pixels having sum equal to k #536

Open chenyongpeng1 opened 1 week ago

chenyongpeng1 commented 1 week ago

Dear biomod2 Development Team,

Thank you for developing and maintaining the biomod2 package, which has been an invaluable tool for species distribution modeling.

myBiomodEM <- BIOMOD_EnsembleModeling(bm.mod = myBiomodModelOut,
                                      models.chosen = 'all',
                                      em.by = 'all',
                                      metric.select = c('TSS'),
                                      metric.select.thresh = c(0.8),
                                      var.import = 3,
                                      metric.eval = c('TSS'),
                                      #em.algo = c('EMmean', 'EMcv', 'EMci', 'EMmedian', 'EMca', 'EMwmean'),
                                      em.algo = c('EMca', 'EMwmean'),
                                      EMci.alpha = 0.05,
                                      EMwmean.decay = 'proportional')

I have a technical question regarding the functionality of the package. Specifically, I am interested in implementing k-fold cross-validation within ensemble models in a way that sums all k binary prediction maps and restricts the final area of occupancy to only those pixels with a sum equal to k. For instance, with k = 10, this approach would retain only those areas classified as 1 by each of the model repetitions (i.e., pixels with a sum of 10).

This approach could provide a more stringent criterion for the final predictions, potentially reducing uncertainty by limiting the final occupied area to consistently predicted pixels.Is it correct to change em.by = 'all'to em.by = 'PA+run'? Or does em.by = 'all' in biomod2 already perform this kind of behavior?"

Thank you very much for your time and assistance. I look forward to your response.

MayaGueguen commented 4 days ago

Hello Chenyongpeng,

Thank you for your detailed question 🙏

However, I'm still unsure about how you want to combine your single models 👀 Do you want to merge together only single models from the same cross-validation dataset ? Or do you want to merge together all k datasets, but by PA ?

Please, have a look at this presentation from slide 35 to 40 to have an example of how are combined single models depending on the value of the em.by parameter. And you can also have a look to the next slides, especially slide 43 which presents the comittee averaging method which is, I think, exactly what you want to do, and retain only pixel where you get 1 at the end ?

Maya

chenyongpeng1 commented 3 days ago

Hello Maya,

Thank you for your patient response.

To clarify, I want to merge together only the single models from the same cross-validation dataset, specifically using MAXENT. Given the importance of the binary maps for the subsequent steps of the analyses, I believe we should assess the variability of these binary maps across repeated model runs. For instance, we could use k-fold cross-validation and sum all k binary maps, then restrict the final area of occupancy to pixels having a sum equal to k. For k = 10, pixels with a sum of 10 indicate areas that were consistently classified as '1' in each of the model repetitions. This method ensures that only the most reliable areas are considered, as they have been validated across all folds. This approach is similar to the committee averaging method you mentioned in the slides and aligns with the methodology used in Polce et al. (2013).

Best regards,

Chenyongpeng

MayaGueguen commented 3 days ago

Hello Chenyongpeng,

No problem, it is important to be sure to understand each other :slightly_smiling_face:

I'll try to summarise with a fake example :

you want to use only MAXENT
let's say you select 2 PA datasets
and k = 3 for simplification

So you get 6 single models :

MAXENT_PA1_RUN1
MAXENT_PA1_RUN2
MAXENT_PA1_RUN3

MAXENT_PA2_RUN1
MAXENT_PA2_RUN2
MAXENT_PA2_RUN3

And you want to obtain, at the end, 2 predictions, merging

MAXENT_PA1_RUN1
MAXENT_PA1_RUN2
MAXENT_PA1_RUN3

on one side and

MAXENT_PA2_RUN1
MAXENT_PA2_RUN2
MAXENT_PA2_RUN3

on the other side, right ? Using the committee averaging method, and keeping after ensembling only the pixels equal to 1.

If we are on the same page, you should use em.by = 'PA' ( :warning: if you use not only MAXENT, it will merge also the algorithms together, or you should use the PA+algo strategy) and em.algo = 'EMca' when calling BIOMOD_EnsembleModeling.

But please tell me if I misunderstood something :eyes:

Maya