deeplearning-wisc / gradnorm_ood

On the Importance of Gradients for Detecting Distributional Shifts in the Wild
Apache License 2.0
53 stars 7 forks source link

[Question] Overconfidence on OOD data and the assumption on uniform distribution? #5

Closed KacperKubara closed 2 years ago

KacperKubara commented 2 years ago

Hi,

In the paper you make a statement on the ID and OOD data:

Gradients are backpropagated from the KullbackLeibler (KL) divergence [22] between the softmax output and a uniform distribution. ID data is expected to have larger KL divergence because the prediction tends to concentrate on one of the ground-truth classes and is therefore less uniformly distributed. As depicted in Figure 1, our key idea is that the gradient norm of the KL divergence is higher for ID data than that for OOD data, making it informative for OOD uncertainty estimation

However, it is also a well-known fact that for OOD data, models tend to be overconfident. In this case, for OOD I would also expect the distribution of the softmax to be 'spiky'. I see that you use temperature scaling for the softmax output which can calibrate the network, but in most of the experiments, this parameter is set as 1. So I was wondering what's your take on that regarding the assumption you make that ID is expected to have larger KL divergence than OOD? The gradient estimation seems to work quite well so I suspect that I don't understand the problem fully. If you could help me clarify that, it would be great!

References on the overconfidence of the models: http://arxiv.org/abs/1610.02136 http://mi.eng.cam.ac.uk/reports/svr-ftp/evermann_stw00.pdf https://arxiv.org/abs/1906.02530

Thanks, Kacper

iurgnauh commented 2 years ago

Hi Kacper,

Thanks for your interest in our work! This is a great question! You are absolutely correct that overconfidence in neural networks is a huge issue, and that's exactly one of our motivation to resort to gradient space for OOD detection. As for your question, I have the following three points that might be relevant:

  1. One advantage of KL divergence is that it accounts for the probability distribution of all classes instead of the dominant class alone. Suppose we have a confidence score of 90% concentrated on a dominant class, the probability distribution of remaining 10% can be different among the rest of classes between ID vs. OOD data.
  2. This is empirically confirmed to some extent in Figure 2, where we plot the GradNorm distribution when using both uniform and one-hot as target vector. Interestingly, when we use one-hot target (label of the dominant class), it's hard to separate ID and OOD. However, when we switch to uniform vector (as in the KL divergence), ID and OOD data become more separable, which demonstrates the superiority of utilizing information from all labels.
  3. Last but not least, in Section 5 we showed that GradNorm can be decomposed into a multiplication of two terms and captures joint information from both output and feature space. Therefore, another advantage of our method is that it also extracts useful information from feature space to mitigate the overconfidence issue in the output space.

Hope this answers your question!

Best, Rui

KacperKubara commented 2 years ago

Hi Rui, Thanks for the great answer, that makes it much clearer now!