Fix comments about kernel choice

Summary

I noticed that some comments in our paper regarding the kernel choice are incorrect. My intuition was wrong, and unfortunately it seemed so obvious that I did not write down a proper proof (which, of course, would not have been possible but probably prevented this error). Fortunately, this issue does neither affect the experiments nor any other results in the paper.

More concretely, the issue is that in general the characteristic property of the tensor product kernel requires stronger assumptions and it is not sufficient that the kernel on the targets is characteristic and the kernel on the predictions is non-zero almost surely.

The intuitive reason is that generally the difference of the two distributions of interest that we compare in the calibration test can't be factorized into a (signed) measure on the target space and the measure on the space of predictions - even though the structure of the random variables $(P_X, Y)$ and $(P_X, Z_X)$ might lead one to believe such incorrect claims.

Example

More specifically, a counter-example to the claim is the tensor product kernel $k((p, y), (p', y')) = \delta_{y,y'}$ with $\mathcal{Y} = {1, \ldots, n}$ (classification with $n > 1$ classes) and $\mathcal{P} = \Delta^{n-1}$ (corresponding probability simplex):

The kernel satisfies the incorrect requirements in the paper:

$k{\mathcal{Y}}(y, y') = \delta{y,y'}$ is a universal, and hence also characteristic, kernel since it is strictly positive definite (see, e.g., section 3.3 in Sriperumbudur et al. (2011)).
$k_{\mathcal{P}}(p, p') = 1$ is non-zero almost surely (regardless of the distribution of $P_X$).

But there are uncalibrated models with zero kernel calibration error:

We have

$$ \begin{split} KCE_k^2 &= E[k((PX, Y), (P{X'}, Y')] - 2 E[k((PX, Y), (P{X'}, Z_{X'})] + E[k((P_X, ZX), (P{X'}, Z{X'})) \ &= E[\delta{Y,Y'}] - 2 E[\delta{Y,Z{X'}}] + E[\delta_{ZX,Z{X'}}] \ &= \sum{y,y'=1}^n \mathbb{P}(Y = y) \mathbb{P}(Y' = y') \delta{y,y'} - 2 \sum{y=1}^n \mathbb{P}(Y = y) E[\delta{y,Z{X}}] + \sum{y=1}^n E[\delta_{y,ZX}\delta{y,Z{X'}}] \ &= \sum{y=1}^n \mathbb{P}(Y = y)^2 - 2 \sum{y=1}^n \mathbb{P}(Y = y) E[\delta{y,Z{X}}] + \sum{y=1}^n E[\delta_{y,ZX}]^2 \ &= \sum{y=1}^n (\mathbb{P}(Y = y) - E[\delta{y,Z{X}}])^2 \ &= \sum{y=1}^n (\mathbb{P}(Y = y) - E[E[\delta{y,Z_{X}} | PX]])^2 \ &= \sum{y=1}^n (\mathbb{P}(Y = y) - E[P_{X}( \{y\} )])^2. \end{split} $$

Consider a model with uniformly distributed predictions $$P_X \sim \operatorname{Dirichlet}(1, 1)$$ and targets $$Y | P_X = \operatorname{Categorical}(p_1, p_2) \sim \operatorname{Categorical}(1(p_1 < 0.5), 1(p_2 < 0.5)).$$
That model is uncalibrated since almost surely $\mathbb{P}(Y | P_X) \neq P_X$.
For $y = 1,2$ we have $PX(\{y\}) \sim \operatorname{Beta}(1, 1) = \operatorname{U}(0, 1)$ and hence $E[P{X}(\{y\})] = 0.5$.
Additionally, for $y = 1,2$, due to the uniform distribution of $PX(\{y\})$ we have $$\mathbb{P}(Y = y) = \int{0}^1 \mathbb{P}(Y = y | P_X(\{y\}) = p) \mathrm{d}p = \int_0^1 1(p < 0.5) \mathrm{d}p = 0.5.$$
Thus $\operatorname{KCE}_k^2 = (0.5 - 0.5)^2 + (0.5 - 0.5)^2 = 0$.

More generally, in classification a tensor product kernel $k = k{\mathcal{P}} \otimes k{\mathcal{Y}}$ is characteristic if and only if $k{\mathcal{P}}$ and $k{\mathcal{Y}}$ are universal (see Corollary 3.15 in Steinwart and Ziegel (2021)). However, in the example here only $k{\mathcal{Y}}$ is universal but $k{\mathcal{P}}$ is not since it is constant (see, e.g., Micchelli et al. (2006)).

devmotion / Calibration_ICLR2021