hadyelsahar commented 4 years ago

Calibration of Pre-trained Transformers https://arxiv.org/abs/2003.07892

participation link is available on the slack channel.

Abstract:

Pre-trained Transformers are now ubiquitous in natural language processing, but despite their high end-task performance, little is known empirically about whether they are calibrated. Specifically, do these models' posterior probabilities provide an accurate empirical measure of how likely the model is to be correct on a given example? We focus on BERT and RoBERTa in this work, and analyze their calibration across three tasks: natural language inference, paraphrase detection, and commonsense reasoning. For each task, we consider in-domain as well as challenging out-of-domain settings, where models face more examples they should be uncertain about. We show that: (1) when used out-of-the-box, pre-trained models are calibrated in-domain, and compared to baselines, their calibration error out-of-domain can be as much as 3.5x lower; (2) temperature scaling is effective at further reducing calibration error in-domain, and using label smoothing to deliberately increase empirical uncertainty helps calibrate posteriors out-of-domain.

AMR-KELEG commented 4 years ago

A toy example for how softmax scores are affected by using a temperature hyperparemeter.

Temperature was used as a post-processing step (after the fine-tuning step) for the model to adjust its prediction scores
```
import numpy as np
```

def softmax(X, temp): scores = np.exp([x/temp for x in X]) return scores / sum(scores)

Assuming the logits for a 3-class classification problem were: [10, 1, 1]

for temp in [1, 10, 100, 1000]: print(f'For temp={temp} the scores from softmax are {softmax([10, 1, 1], temp)}')


The output is:

For temp=1 the scores from softmax are [9.99753241e-01 1.23379352e-04 1.23379352e-04] For temp=10 the scores from softmax are [0.5515296 0.2242352 0.2242352] For temp=100 the scores from softmax are [0.353624 0.323188 0.323188] For temp=1000 the scores from softmax are [0.33533632 0.33233184 0.33233184]

hadyelsahar commented 4 years ago

sorry for missing the discussion but here's my take on the paper:

1- Calibration is has a long history and currently is an active research area. You can find a collection of references following the on the last slide here: https://docs.google.com/presentation/d/1KMP5Aptu6we42GCRMcHbvhOUlq1vYE4SqZnCujMMPTU/edit?usp=sharing

Calibration has also connections with Bayesian DL in terms of uncertainity calculation and it is mainly used for applications such as Out of distribution OOD detection, and robustness against adversarial attacks.

2- note: temp scaling doesn't change accuracy of models, like it will not increase robustness since argmax of the logits will always stay the same if all divided by the same constant. other forms like plat scaling could affect the max since its bilinear but temp scaling not. for a good reference check guo et al paper: https://arxiv.org/pdf/1706.04599.pdf

3- reliability diagrams can give a poor impression of ECE when the bins are not balanced, since ECE are sample normalized and reliability diagrams aren't.

That is why its a common practice to pair this with a second diagram that shows no. of samples in each bin. check this scitkit tutorial: https://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html#sphx-glr-auto-examples-calibration-plot-calibration-curve-py

4- for a followup work check CALIBRATION OF ENCODER DECODER MODELS FOR NEURAL MACHINE TRANSLATION https://arxiv.org/pdf/1903.00802.pdf

5- There are some recent work that showed that modeling the generative and discriminative process together in a classifier can help a lot the calibration of the resulting discriminative model the combination of such can be seen as an "Energy based model" There's an interesting paper about that by Will Grathwohl YOUR CLASSIFIER IS SECRETLY AN ENERGY BASED MODEL AND YOU SHOULD TREAT IT LIKE ONE video

eg-nlp-community / nlp-reading-group

[20/06/2020] Saturday 9:00 PM GMT+2 Calibration of Pre-trained Transformers #18

A toy example for how softmax scores are affected by using a temperature hyperparemeter.

Assuming the logits for a 3-class classification problem were: [10, 1, 1]