Open konstantinos-p opened 1 year ago
Hi! thanks for your contribution!, great first issue!
Hi @konstantinos-p, Thanks for proposing adding this metric to torchmetrics :) It definitly sound like a good idea to have this metric and its decompositions. Feel free to send a pull request whenever you have something that seems to be working, then we can help you from there.
Hi, is there any update on this? I would like to use brier's score for my current project
🚀 Feature
I would like to contribute to torchmetrics, by implementing the Brier score and its associated decomposition.
Motivation
The Brier score is widely used when measuring the calibration of machine learning methods see:
https://arxiv.org/abs/2302.04019
https://arxiv.org/abs/2002.06470
It is also a proper scoring rule as opposed to the Expected Calibration Error (ECE) and Thresholded Calibration Error (TACE). This means that the ECE and the TACE have trivial minima where the classifier has zero test accuracy while being perfectly calibrated (https://arxiv.org/abs/1906.02530). The Brier score being a proper scoring rule doesn't have this pathological behaviour.
The Brier score coincides with the mean squared error for common use cases. However, its decomposition into resolution, reliability and uncertainty see https://en.wikipedia.org/wiki/Brier_score is a unique and useful feature. Roughly speaking
resolution
captures a notion of accuracy andreliability
a notion of calibration. Thus both have to be optimized for the Brier score to be low.Finally, no standard implementation in common packages exists to the best of my knowledge.
Pitch
I plan to follow the original paper describing the decomposition of the Brier score into resolution, reliability and uncertainty
https://journals.ametsoc.org/view/journals/apme/12/4/1520-0450_1973_012_0595_anvpot_2_0_co_2.xml
and specifically the implementation found in
https://github.com/google-research/google-research/blob/master/uq_benchmark_2019/metrics_lib.py and the paper https://arxiv.org/abs/1906.02530
The decomposition into uncertainty, resolution and reliability was originally formulated for predictions which take a finite set of values. This is in contrast with the output vectors of most deep neural network classifiers which output a vector of probabilities per class, which take continuous values. Thus we need to create bins for our output vectors. The specific bins in this implementation are with respect to the top most probable class for each input signal. Thus we create C bins where C is the number of classes. Then two prediction vectors [0 , 0.9, 0.1] and [0.2, 0.6, 0.2] fall in the same bin, the bin of class 2. The derivation of resolution, reliability and uncertainty is then relatively straighforward.