GEO-BON / bon-in-a-box-pipelines

BON in a Box 2.0 - Sampling optimisation and indicator pipelines
MIT License
10 stars 7 forks source link

Validation statistics for SDMs #138

Open tpoisot opened 8 months ago

tpoisot commented 8 months ago

I was going to bring this up at the next meeting in 2024, but we should indeed make sure that there's a systematic validation of the SDMs with a series of reliable measures. There's been recent literature in ML showing that AUC (ROC-AUC anyways) can be high even for "bad" models.

(cc. @glaroc -- #137)

frousseu commented 8 months ago

@glaroc While validation statistics would be cool, I don't think there is currently any reliable way to validate models built with biased data if no external standardized data or data not subject to the same bias is available. I see a lot of models with better AUC when bias is ignored than when it is taken into account (even though the former may look like complete crap). As such, I think that performance measures may be misleading and give a false sense of confidence in the outputs. Perhaps with more severe block CV performance measures are less misleading (maybe Laura would have something to say about this)? I may be wrong, but I don't think that other measures would be any better if the validation is done on the same bias data. Perhaps this https://doi.org/10.1016/j.ecolind.2022.109487 offers some ideas.

tpoisot commented 8 months ago

I think we are talking about two different purposes for measure of model performance here. What I have in mind is a measure of how well the model learned from the data, and I see it more as "absolutely mandatory" than "cool", but because I'm looking at this problem from (in part) an applied ML point of view. This requires some sort of validation, which can be done using crossCV or other packages.

Maybe a script is not the correct way to approach this problem, and instead we could mandate a series of statistics with average + 95% CI for some form of cross-validation.

frousseu commented 8 months ago

Ok, what I had in mind for a performance measure was more in the sense of measuring how well does a model likely represents the actual distribution of a species.

tpoisot commented 8 months ago

Agreed, but this is a different (second?) step. Ideally, before having this discussion, it's important to know whether the training of the model went well at all, and that's what a standard set of performance measures would indicate. I don't think it's an either/or situation at all, and both can be built in parallel.

And we can inform pipeline users / developers that if the model doesn't have good performance, the question of its fit to the actual distribution shouldn't even be considered.

frousseu commented 8 months ago

Ok, I understand now that what you refer to is how well does the model (and its predictors) are able to explain/predict the patterns in the data. My concern here would be is good performance in this sense positively correlated to good performance in representing the true distribution of species.

glaroc commented 8 months ago

I think in any case, the SDM pipelines should provide the fit statistics as outputs. Whether we use them or not for selecting the right model is an open debate.