Cross validation - Githubissues

casadoj commented 1 year ago

The optimal criteria seems very sensitive to the stations included in the analyses.

To remove this sensitivity, the optimization process could be done in a cross validation approach. We could select a large amount of stations to fit the criteria and test it in a smaller set of stations, and the whole process could be repeated several times.

casadoj commented 1 year ago

I have created a simple cross-validation approach in the optimization of the notification criteria. It is implemented in the notebook 7_skill.ipynb.

First, I divide the set of reporting points in a training and a test sample. Then, I subdivide the training sample in $kfold$ subsets and compute the skill in each of them. I average the skill over subsets and on that average I apply the optimization of the criteria (find_best_criteria). The generation of samples is random and stratified, so there's no geographic connection between samples and the proportion of observed events is kept.

It is still missing the implementation of cross-validation in the optimization by area and lead time.

casadoj commented 1 year ago

I have implemented the cross-validation option in the optimization of the criteria for both catchment area and lead time. Now the results for 2000 km² and 60 h lead time match between the fixed and the varying-probability optimization.

casadoj / EFAS_skill

Cross validation #5