devmotion / Calibration_ICLR2021

Repository of ICLR 2021 paper "Calibration tests beyond classification"
https://devmotion.github.io/Calibration_ICLR2021/dev/
MIT License
4 stars 1 forks source link

Fix comments about kernel choice #6

Closed devmotion closed 1 year ago

devmotion commented 2 years ago

Summary

I noticed that some comments in our paper regarding the kernel choice are incorrect. My intuition was wrong, and unfortunately it seemed so obvious that I did not write down a proper proof (which, of course, would not have been possible but probably prevented this error). Fortunately, this issue does neither affect the experiments nor any other results in the paper.

More concretely, the issue is that in general the characteristic property of the tensor product kernel requires stronger assumptions and it is not sufficient that the kernel on the targets is characteristic and the kernel on the predictions is non-zero almost surely.

The intuitive reason is that generally the difference of the two distributions of interest that we compare in the calibration test can't be factorized into a (signed) measure on the target space and the measure on the space of predictions - even though the structure of the random variables $(P_X, Y)$ and $(P_X, Z_X)$ might lead one to believe such incorrect claims.


Example

More specifically, a counter-example to the claim is the tensor product kernel $k((p, y), (p', y')) = \delta_{y,y'}$ with $\mathcal{Y} = {1, \ldots, n}$ (classification with $n > 1$ classes) and $\mathcal{P} = \Delta^{n-1}$ (corresponding probability simplex):

The kernel satisfies the incorrect requirements in the paper:

But there are uncalibrated models with zero kernel calibration error:

$$ \begin{split} KCE_k^2 &= E[k((PX, Y), (P{X'}, Y')] - 2 E[k((PX, Y), (P{X'}, Z_{X'})] + E[k((P_X, ZX), (P{X'}, Z{X'})) \ &= E[\delta{Y,Y'}] - 2 E[\delta{Y,Z{X'}}] + E[\delta_{ZX,Z{X'}}] \ &= \sum{y,y'=1}^n \mathbb{P}(Y = y) \mathbb{P}(Y' = y') \delta{y,y'} - 2 \sum{y=1}^n \mathbb{P}(Y = y) E[\delta{y,Z{X}}] + \sum{y=1}^n E[\delta_{y,ZX}\delta{y,Z{X'}}] \ &= \sum{y=1}^n \mathbb{P}(Y = y)^2 - 2 \sum{y=1}^n \mathbb{P}(Y = y) E[\delta{y,Z{X}}] + \sum{y=1}^n E[\delta_{y,ZX}]^2 \ &= \sum{y=1}^n (\mathbb{P}(Y = y) - E[\delta{y,Z{X}}])^2 \ &= \sum{y=1}^n (\mathbb{P}(Y = y) - E[E[\delta{y,Z_{X}} | PX]])^2 \ &= \sum{y=1}^n (\mathbb{P}(Y = y) - E[P_{X}( \{y\} )])^2. \end{split} $$

More generally, in classification a tensor product kernel $k = k{\mathcal{P}} \otimes k{\mathcal{Y}}$ is characteristic if and only if $k{\mathcal{P}}$ and $k{\mathcal{Y}}$ are universal (see Corollary 3.15 in Steinwart and Ziegel (2021)). However, in the example here only $k{\mathcal{Y}}$ is universal but $k{\mathcal{P}}$ is not since it is constant (see, e.g., Micchelli et al. (2006)).