WorldCereal / presto-worldcereal

2 stars 0 forks source link

Tune thresholds on validation set #16

Closed rubencart closed 4 months ago

rubencart commented 7 months ago

We should tune a threshold per model (finetune, RF, regression) for a positive vs negative prediction on the val set instead of using 0.5.

kvantricht commented 7 months ago

How do you do that? Is it anyway a good idea to tune based on validation data? Intuitively I would think threshold 0.5 gives a balanced precision/recall. In the global production of V1 we played a bit with this and for example for irrigation we decided to put the threshold at 0.6 because we were afraid of too much overdetection of irrigation. But at the end it turned out we now miss too much, so that likely the 0.5 which in the end is anyway what the algorithm uses during training, would have been the better choice.

rubencart commented 7 months ago

I agree with your intuition, and tell me if you disagree, but I think class imbalance and differences in input distributions can cause in an unpredictable way the network to be more inclined to output one class than another. I think we could choose a metric (or a couple of) that we care about, then compute these metrics with thresholds = [0.1, 0.2, ..., 0.9] (once, per model, not every time we train/evaluate), and see which threshold optimizes them? We can do this both for CatBoost and Presto of course.

In other words if you decide to mainly look at F1 because you kind of equally care about precision and recall, I think there is no guarantee that the F1 will be highest for threshold = 0.5, even not if you train with a loss that incentivises this. On the other hand if we care about a good precision/recall tradeoff across different threshold values we could look at metrics that take this into account like AUC or AP.

Is it anyway a good idea to tune based on validation data?

Maybe we're used to different terminology but I thought that's what validation data is for? We shouldn't tune hyperparameters on test data of course.

This is not a big effort so I'll come up with something and then we can still see what we do with it.

kvantricht commented 7 months ago

Yeah so we usually work with sample-based loss weighting in catboost, where the weight is based on the class weight (for the imbalance) and corrected based on the score we have for the dataset it’s coming from. So in essence we try to mitigate the class imbalance by loss weighting so the 0.5 threshold kind of works. We can discuss later! And indeed parameter tuning based on validation should be fine. I got confused 😉

rubencart commented 7 months ago

Yes let's discuss! Next to optimizing performance we also want a fair comparison of course :).

We can also finetune Presto with a weighted loss (and fit the sklearn classifiers balancing the classes), so maybe it's more fair to do that and not tune the threshold, like you did for catboost.

rubencart commented 7 months ago

I was assuming val_df.catboost_prediction == val_df.catboost_confidence > th for some th but that does not seem to be the case right?

kvantricht commented 7 months ago

No confidence is computed differently. Cannot look it up properly on the phone but the equation is in the paper: https://essd.copernicus.org/preprints/essd-2023-184/

kvantricht commented 7 months ago

So like this:

image

rubencart commented 7 months ago

Ok cool thanks!

kvantricht commented 4 months ago

Closed for now.