Lightning-AI / torchmetrics

Machine learning metrics for distributed, scalable PyTorch applications.
https://lightning.ai/docs/torchmetrics/
Apache License 2.0
2.15k stars 409 forks source link

Implement the Brier score and it's decomposition into resolution, reliability and uncertainty. #2196

Open konstantinos-p opened 1 year ago

konstantinos-p commented 1 year ago

🚀 Feature

I would like to contribute to torchmetrics, by implementing the Brier score and its associated decomposition.

Motivation

The Brier score is widely used when measuring the calibration of machine learning methods see:

https://arxiv.org/abs/2302.04019
https://arxiv.org/abs/2002.06470

It is also a proper scoring rule as opposed to the Expected Calibration Error (ECE) and Thresholded Calibration Error (TACE). This means that the ECE and the TACE have trivial minima where the classifier has zero test accuracy while being perfectly calibrated (https://arxiv.org/abs/1906.02530). The Brier score being a proper scoring rule doesn't have this pathological behaviour.

The Brier score coincides with the mean squared error for common use cases. However, its decomposition into resolution, reliability and uncertainty see https://en.wikipedia.org/wiki/Brier_score is a unique and useful feature. Roughly speaking resolution captures a notion of accuracy and reliability a notion of calibration. Thus both have to be optimized for the Brier score to be low.

Finally, no standard implementation in common packages exists to the best of my knowledge.

Pitch

I plan to follow the original paper describing the decomposition of the Brier score into resolution, reliability and uncertainty

https://journals.ametsoc.org/view/journals/apme/12/4/1520-0450_1973_012_0595_anvpot_2_0_co_2.xml

and specifically the implementation found in

https://github.com/google-research/google-research/blob/master/uq_benchmark_2019/metrics_lib.py and the paper https://arxiv.org/abs/1906.02530

The decomposition into uncertainty, resolution and reliability was originally formulated for predictions which take a finite set of values. This is in contrast with the output vectors of most deep neural network classifiers which output a vector of probabilities per class, which take continuous values. Thus we need to create bins for our output vectors. The specific bins in this implementation are with respect to the top most probable class for each input signal. Thus we create C bins where C is the number of classes. Then two prediction vectors [0 , 0.9, 0.1] and [0.2, 0.6, 0.2] fall in the same bin, the bin of class 2. The derivation of resolution, reliability and uncertainty is then relatively straighforward.

github-actions[bot] commented 1 year ago

Hi! thanks for your contribution!, great first issue!

SkafteNicki commented 1 year ago

Hi @konstantinos-p, Thanks for proposing adding this metric to torchmetrics :) It definitly sound like a good idea to have this metric and its decompositions. Feel free to send a pull request whenever you have something that seems to be working, then we can help you from there.

manavkulshrestha commented 3 months ago

Hi, is there any update on this? I would like to use brier's score for my current project