Closed BRamiroSanchez closed 1 year ago
Hi Berta, Thank you for reporting and submitting and nicely formatted issue with a reproducible example :pray:
First concerning the error you get, the issue is that when you do:
sampsize = c("0" = min(as.numeric(table(run.data@data.species))), "1" = min(as.numeric(table(run.data@data.species)))),
You calculate the number based on the full dataset, instead of the crossvalidated one. If you want an approach specifying sampsize based on the smallest number of 0
or 1
present in the dataset, you can use summary
as follow:
library(dplyr)
calib.summary <-
summary(run.data, calib.lines = cur.blocks$biomod_table) %>%
filter(dataset == "calibration")
The table contains the summary of all calibration and validation dataset in terms of presences and absences.
You can then use it to feed argument sampsize
in BIOMOD_ModelingOptions
as follows:
sampsize = c("0" = min(calib.summary$True_Absences),
"1" = min(calib.summary$Presences))
Now concerning sampsize
, I am not sure to understand what you are trying to achieve. By default, if you do not provide sampsize
argument, RF
in biomod2
will use all presences and absences available in each fold. So no need to specify sampsize
if that is what you are trying to achieve.
Alternatively if you want RF
to use only a fraction of the data for each calibration. You can set sampsize
to lower values, but the values are common accross all cross-validation fold, so you need to use values lower than the minimum number of presences or absences in all folds (which you can get using summary
as shown above).
Finally, if you want to have specific sampsize
for each one of your fold (e.g. 50% of the data available in the fold), this cannot be achieved for now, but should be possible with next biomod2
release (eta end of summer).
Best regards, Rémi
Hi @rpatin
Thank you so much for your detailed answer and code hints :)
Yes, as in your last suggestion, what I would like is to set the sampsize specific to the calibration data available in each fold. So, each calibrated model from each fold (a total of 4 RF models in my example) would have the sample of both "0s" and "1s" down sized to the minimum available in that fold.
If:
calib.summary <-
summary(run.data, calib.lines = cur.blocks$biomod_table) %>%
filter(dataset == "calibration")
unique(calib.summary)
dataset run PA Presences True_Absences Pseudo_Absences Undefined
1 calibration RUN1 allData 221 103 0 NA
5 calibration RUN2 allData 192 91 0 NA
9 calibration RUN3 allData 196 102 0 NA
13 calibration RUN4 allData 237 88 0 NA
Then, the sampsize for the RF model from fold1 ("cluster1_allData_RUN1_RF") would be:
sampsize = c("0" = 103 , # E.g. RUN1: "0" = min(calib.summary$True_Absences, calib.summary$Presences)
"1" = 103) # E.g. RUN1: "1" = min(calib.summary$True_Absences, calib.summary$Presences)
Looking forward to the next release :)
Cheers, Berta
Hi Berta,
Thank you for the update :pray:
I now fully understand your issue, unfortunately it pertains to the third part of my points (have specific sampsize
for each one of your fold), which is not possible yet, but should be in a near future.
Meanwhile If your partition are not too much imbalanced (similar number of presences/absences, you can still go with the minimum number of presences/absences and set the same value for all folds. This will ensure a prevalence of 50% in your tree building and should not be that much different from the ideal solution (given that imbalance is limited).
Cheers, Rémi
Hi Rémi,
Thank you for confirming; I might then proceed with your suggestion in the meantime. Thank you again for being so responsive! :)
Cheers, Berta
Hi @rpatin , I see that this issue has been resolved and implemented, which is great!
However I am not sure how to do that.
Could you give here an example of how to define the argument sampsize
such that we have a balanced sample size for each fold, as in your message here?
Finally, if you want to have specific
sampsize
for each one of your fold (e.g. 50% of the data available in the fold), this cannot be achieved for now, but should be possible with nextbiomod2
release (eta end of summer).
Many thanks! Boris
For anyone interested, example code can be found in #393.
Hi
I'm also trying to implement RF down-sample following best guidelines by Valavi et al. (2021), with a user-defined k-fold spatial block cross-validation strategy.
I have built my own CV.user.table for cross-validation using the blockCV R package. To match the resulting object from
blockCV::cv_spatial()
with thecalib.lines
object from biomod, I have made sure to rename the column names accordingly (i.e.,_allData_RUNx
) . When I try to runbiomod_Modeling()
with the modified RF down-sample options (i.e., sampsize argument), I get an error error regarding the sampsize argument:My guess, after reading thread #172 and thread #267, is that I might not be specifying the correct number of samples (too many) after having split my data into folds for cross-validation. My intention is to always downsample to the minimum of presence/absence occurrences available in the k-fold. Below is an extract of my code:
How could I specify in the sampsize argument the fact that the "total data" to calculate the minimum of 0/1 should actually be in relation to how the training data (0/1s) is distributed in each fold? Currently, as the sampsize argument is, it doesn't take into account the data in each fold.
Thank you so much in advance for any help.
Kind regards, Berta
data_biomod.zip
Originally posted by @BRamiroSanchez in https://github.com/biomodhub/biomod2/issues/172#issuecomment-1616775960