Abstract
Reliably estimating the uncertainty of a prediction throughout the model lifecycle is crucial in many safety-critical applications. The most common way to measure this uncertainty is via the predicted confidence. While this tends to work well for in-domain samples, these estimates are unreliable under domain drift. Alternatively, a bias-variance decomposition allows to directly measure the predictive uncertainty across the entire input space. But, such a decomposition for proper scores does not exist in current literature, and for exponential families it is convoluted. In this work, we introduce a general bias-variance decomposition for proper scores and reformulate the exponential family case, giving rise to the Bregman Information as the variance term in both cases. This allows us to prove that the Bregman Information for classification measures the uncertainty in the logit space. We showcase the practical relevance of this decomposition on two downstream tasks. First, we show how to construct confidence intervals for predictions on the instance-level based on the Bregman Information. Second, we demonstrate how different approximations of the instance-level Bregman Information allow reliable out-of-distribution detection for all degrees of domain drift.
Keyword: expected calibration error
There is no result
Keyword: overconfident
Augmentation by Counterfactual Explanation -- Fixing an Overconfident Classifier
Abstract
A highly accurate but overconfident model is ill-suited for deployment in critical applications such as healthcare and autonomous driving. The classification outcome should reflect a high uncertainty on ambiguous in-distribution samples that lie close to the decision boundary. The model should also refrain from making overconfident decisions on samples that lie far outside its training distribution, far-out-of-distribution (far-OOD), or on unseen samples from novel classes that lie near its training distribution (near-OOD). This paper proposes an application of counterfactual explanations in fixing an over-confident classifier. Specifically, we propose to fine-tune a given pre-trained classifier using augmentations from a counterfactual explainer (ACE) to fix its uncertainty characteristics while retaining its predictive performance. We perform extensive experiments with detecting far-OOD, near-OOD, and ambiguous samples. Our empirical results show that the revised model have improved uncertainty measures, and its performance is competitive to the state-of-the-art methods.
A study of uncertainty quantification in overparametrized high-dimensional models
Authors: Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová
Abstract
Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent like behavior in the calibration curve of optimally regularized estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization.
Keyword: overconfidence
A study of uncertainty quantification in overparametrized high-dimensional models
Authors: Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová
Abstract
Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent like behavior in the calibration curve of optimally regularized estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization.
Keyword: confidence
Probing with Noise: Unpicking the Warp and Weft of Embeddings
Authors: Filip Klubička, John D. Kelleher
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Abstract
Improving our understanding of how information is encoded in vector space can yield valuable interpretability insights. Alongside vector dimensions, we argue that it is possible for the vector norm to also carry linguistic information. We develop a method to test this: an extension of the probing framework which allows for relative intrinsic interpretations of probing results. It relies on introducing noise that ablates information encoded in embeddings, grounded in random baselines and confidence intervals. We apply the method to well-established probing tasks and find evidence that confirms the existence of separate information containers in English GloVe and BERT embeddings. Our correlation analysis aligns with the experimental findings that different encoders use the norm to encode different kinds of information: GloVe stores syntactic and sentence length information in the vector norm, while BERT uses it to encode contextual incongruity.
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
Abstract
Reliably estimating the uncertainty of a prediction throughout the model lifecycle is crucial in many safety-critical applications. The most common way to measure this uncertainty is via the predicted confidence. While this tends to work well for in-domain samples, these estimates are unreliable under domain drift. Alternatively, a bias-variance decomposition allows to directly measure the predictive uncertainty across the entire input space. But, such a decomposition for proper scores does not exist in current literature, and for exponential families it is convoluted. In this work, we introduce a general bias-variance decomposition for proper scores and reformulate the exponential family case, giving rise to the Bregman Information as the variance term in both cases. This allows us to prove that the Bregman Information for classification measures the uncertainty in the logit space. We showcase the practical relevance of this decomposition on two downstream tasks. First, we show how to construct confidence intervals for predictions on the instance-level based on the Bregman Information. Second, we demonstrate how different approximations of the instance-level Bregman Information allow reliable out-of-distribution detection for all degrees of domain drift.
On the Calibration of Massively Multilingual Language Models
Abstract
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer. While there has been much work in evaluating these models for their performance on a variety of tasks and languages, little attention has been paid on how well calibrated these models are with respect to the confidence in their predictions. We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages or those which are typologically diverse from English. Next, we empirically show that calibration methods like temperature scaling and label smoothing do reasonably well towards improving calibration in the zero-shot scenario. We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially. Overall, our work contributes towards building more reliable multilingual models by highlighting the issue of their miscalibration, understanding what language and model specific factors influence it, and pointing out the strategies to improve the same.
HuPR: A Benchmark for Human Pose Estimation Using Millimeter Wave Radar
Abstract
This paper introduces a novel human pose estimation benchmark, Human Pose with Millimeter Wave Radar (HuPR), that includes synchronized vision and radio signal components. This dataset is created using cross-calibrated mmWave radar sensors and a monocular RGB camera for cross-modality training of radar-based human pose estimation. There are two advantages of using mmWave radar to perform human pose estimation. First, it is robust to dark and low-light conditions. Second, it is not visually perceivable by humans and thus, can be widely applied to applications with privacy concerns, e.g., surveillance systems in patient rooms. In addition to the benchmark, we propose a cross-modality training framework that leverages the ground-truth 2D keypoints representing human body joints for training, which are systematically generated from the pre-trained 2D pose estimation network based on a monocular camera input image, avoiding laborious manual label annotation efforts. The framework consists of a new radar pre-processing method that better extracts the velocity information from radar data, Cross- and Self-Attention Module (CSAM), to fuse multi-scale radar features, and Pose Refinement Graph Convolutional Networks (PRGCN), to refine the predicted keypoint confidence heatmaps. Our intensive experiments on the HuPR benchmark show that the proposed scheme achieves better human pose estimation performance with only radar data, as compared to traditional pre-processing solutions and previous radio-frequency-based methods.
Fast Beam Alignment via Pure Exploration in Multi-armed Bandits
Authors: Yi Wei, Zixin Zhong, Vincent Y. F. Tan
Subjects: Information Theory (cs.IT); Machine Learning (cs.LG)
Abstract
The beam alignment (BA) problem consists in accurately aligning the transmitter and receiver beams to establish a reliable communication link in wireless communication systems. Existing BA methods search the entire beam space to identify the optimal transmit-receive beam pair. This incurs a significant latency when the number of antennas is large. In this work, we develop a bandit-based fast BA algorithm to reduce BA latency for millimeter-wave (mmWave) communications. Our algorithm is named Two-Phase Heteroscedastic Track-and-Stop (2PHT\&S). We first formulate the BA problem as a pure exploration problem in multi-armed bandits in which the objective is to minimize the required number of time steps given a certain fixed confidence level. By taking advantage of the correlation structure among beams that the information from nearby beams is similar and the heteroscedastic property that the variance of the reward of an arm (beam) is related to its mean, the proposed algorithm groups all beams into several beam sets such that the optimal beam set is first selected and the optimal beam is identified in this set after that. Theoretical analysis and simulation results on synthetic and semi-practical channel data demonstrate the clear superiority of the proposed algorithm vis-`a-vis other baseline competitors.
A study of uncertainty quantification in overparametrized high-dimensional models
Authors: Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová
Abstract
Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent like behavior in the calibration curve of optimally regularized estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization.
Learning Neural Radiance Fields from Multi-View Geometry
Authors: Marco Orsingher, Paolo Zani, Paolo Medici, Massimo Bertozzi
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
We present a framework, called MVG-NeRF, that combines classical Multi-View Geometry algorithms and Neural Radiance Fields (NeRF) for image-based 3D reconstruction. NeRF has revolutionized the field of implicit 3D representations, mainly due to a differentiable volumetric rendering formulation that enables high-quality and geometry-aware novel view synthesis. However, the underlying geometry of the scene is not explicitly constrained during training, thus leading to noisy and incorrect results when extracting a mesh with marching cubes. To this end, we propose to leverage pixelwise depths and normals from a classical 3D reconstruction pipeline as geometric priors to guide NeRF optimization. Such priors are used as pseudo-ground truth during training in order to improve the quality of the estimated underlying surface. Moreover, each pixel is weighted by a confidence value based on the forward-backward reprojection error for additional robustness. Experimental results on real-world data demonstrate the effectiveness of this approach in obtaining clean 3D meshes from images, while maintaining competitive performances in novel view synthesis.
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data
Authors: Nabeel Seedat, Jonathan Crabbé, Ioana Bica, Mihaela van der Schaar
Abstract
High model performance, on average, can hide that models may systematically underperform on subgroups of the data. We consider the tabular setting, which surfaces the unique issue of outcome heterogeneity - this is prevalent in areas such as healthcare, where patients with similar features can have different outcomes, thus making reliable predictions challenging. To tackle this, we propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets. We show that Data-IQ's characterization of examples is most robust to variation across similarly performant (yet different) models, compared to baselines. Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, we demonstrate that the subgroups enable us to construct new approaches to both feature acquisition and dataset selection. Furthermore, we highlight how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.
ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation
Authors: Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen
Abstract
We study the text generation task under the approach of pre-trained language models (PLMs). Typically, an auto-regressive (AR) method is adopted for generating texts in a token-by-token manner. Despite many advantages of AR generation, it usually suffers from inefficient inference. Therefore, non-autoregressive (NAR) models are proposed to generate all target tokens simultaneously. However, NAR models usually generate texts of lower quality due to the absence of token dependency in the output text. In this paper, we propose ELMER: an efficient and effective PLM for NAR text generation to explicitly model the token dependency during NAR generation. By leveraging the early exit technique, ELMER enables the token generations at different layers, according to their prediction confidence (a more confident token will exit at a lower layer). Besides, we propose a novel pre-training objective, Layer Permutation Language Modeling, to pre-train ELMER by permuting the exit layer for each token in sequences. Experiments on three text generation tasks show that ELMER significantly outperforms NAR models and further narrows the performance gap with AR PLMs (\eg ELMER (29.92) vs BART (30.61) ROUGE-L in XSUM) while achieving over 10 times inference speedup.
Reliability-Aware Prediction via Uncertainty Learning for Person Image Retrieval
Abstract
Current person image retrieval methods have achieved great improvements in accuracy metrics. However, they rarely describe the reliability of the prediction. In this paper, we propose an Uncertainty-Aware Learning (UAL) method to remedy this issue. UAL aims at providing reliability-aware predictions by considering data uncertainty and model uncertainty simultaneously. Data uncertainty captures the ``noise" inherent in the sample, while model uncertainty depicts the model's confidence in the sample's prediction. Specifically, in UAL, (1) we propose a sampling-free data uncertainty learning method to adaptively assign weights to different samples during training, down-weighting the low-quality ambiguous samples. (2) we leverage the Bayesian framework to model the model uncertainty by assuming the parameters of the network follow a Bernoulli distribution. (3) the data uncertainty and the model uncertainty are jointly learned in a unified network, and they serve as two fundamental criteria for the reliability assessment: if a probe is high-quality (low data uncertainty) and the model is confident in the prediction of the probe (low model uncertainty), the final ranking will be assessed as reliable. Experiments under the risk-controlled settings and the multi-query settings show the proposed reliability assessment is effective. Our method also shows superior performance on three challenging benchmarks under the vanilla single query settings.
Keyword: scaling
A New Perspective for Understanding Generalization Gap of Deep Neural Networks Trained with Large Batch Sizes
Authors: Oyebade K. Oyedotun, Konstantinos Papadopoulos, Djamila Aouada
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Abstract
Deep neural networks (DNNs) are typically optimized using various forms of mini-batch gradient descent algorithm. A major motivation for mini-batch gradient descent is that with a suitably chosen batch size, available computing resources can be optimally utilized (including parallelization) for fast model training. However, many works report the progressive loss of model generalization when the training batch size is increased beyond some limits. This is a scenario commonly referred to as generalization gap. Although several works have proposed different methods for alleviating the generalization gap problem, a unanimous account for understanding generalization gap is still lacking in the literature. This is especially important given that recent works have observed that several proposed solutions for generalization gap problem such learning rate scaling and increased training budget do not indeed resolve it. As such, our main exposition in this paper is to investigate and provide new perspectives for the source of generalization loss for DNNs trained with a large batch size. Our analysis suggests that large training batch size results in increased near-rank loss of units' activation (i.e. output) tensors, which consequently impacts model optimization and generalization. Extensive experiments are performed for validation on popular DNN models such as VGG-16, residual network (ResNet-56) and LeNet-5 using CIFAR-10, CIFAR-100, Fashion-MNIST and MNIST datasets.
On the Calibration of Massively Multilingual Language Models
Abstract
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer. While there has been much work in evaluating these models for their performance on a variety of tasks and languages, little attention has been paid on how well calibrated these models are with respect to the confidence in their predictions. We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages or those which are typologically diverse from English. Next, we empirically show that calibration methods like temperature scaling and label smoothing do reasonably well towards improving calibration in the zero-shot scenario. We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially. Overall, our work contributes towards building more reliable multilingual models by highlighting the issue of their miscalibration, understanding what language and model specific factors influence it, and pointing out the strategies to improve the same.
Exploring Representation-Level Augmentation for Code Search
Abstract
Code search, which aims at retrieving the most relevant code fragment for a given natural language query, is a common activity in software development practice. Recently, contrastive learning is widely used in code search research, where many data augmentation approaches for source code (e.g., semantic-preserving program transformation) are proposed to learn better representations. However, these augmentations are at the raw-data level, which requires additional code analysis in the preprocessing stage and additional training costs in the training stage. In this paper, we explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training, and based on this we propose a general format of representation-level augmentation that unifies existing methods. Then, we propose three new augmentation methods (linear extrapolation, binary interpolation, and Gaussian scaling) based on the general format. Furthermore, we theoretically analyze the advantages of the proposed augmentation methods over traditional contrastive learning methods on code search. We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. The experimental results show that our approach can consistently boost the performance of the studied code search models. Our source code is available at https://github.com/Alex-HaochenLi/RACS.
A study of uncertainty quantification in overparametrized high-dimensional models
Authors: Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová
Abstract
Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent like behavior in the calibration curve of optimally regularized estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization.
To Signal or Not to Signal? Layering Traffic Analysis Resistance on Secure Instant Messaging
Abstract
Traffic analysis for instant messaging (IM) applications continues to pose an important privacy challenge. In particular, transport-level data can leak unintentional information about IM -- such as who communicates with whom. Existing tools for metadata privacy have adoption obstacles, including the risks of being scrutinized for having a particular app installed, and performance overheads incompatible with mobile devices. We posit that resilience to traffic analysis must be directly supported by major IM services themselves, and must be done in a low-latency manner without breaking existing features. As a first step in this direction, we propose a hybrid model that combines regular network traffic and deniable messages. We present a novel protocol for deniable instant messaging that we call DenIM that is a variant of the Signal protocol. DenIM is built on the principle that deniable messages can be incorporated as part of padding in regular traffic. By padding traffic, DenIM achieves bandwidth overhead that scales with the volume of regular traffic, as opposed to scaling with time or the number of users. To show the effectiveness of DenIM, we construct a formal model and prove that DenIM's deniability guarantees hold against strong adversaries such as internet service providers, and implement and empirically evaluate a proof-of-concept version of DenIM.
GFlowOut: Dropout with Generative Flow Networks
Authors: Dianbo Liu, Moksh Jain, Bonaventure Dossou, Qianli Shen, Salem Lahlou, Anirudh Goyal, Nikolay Malkin, Chris Emezue, Dinghuai Zhang, Nadhir Hassen, Xu Ji, Kenji Kawaguchi, Yoshua Bengio
Abstract
Bayesian Inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and generalization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference and to estimate uncertainty with deep neural networks. Traditionally, the dropout mask is sampled independently from a fixed distribution. Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference. These methods face two important challenges: (a) the posterior distribution over masks can be highly multi-modal which can be difficult to approximate with standard variational inference and (b) it is not trivial to fully utilize sample-dependent information and correlation among dropout masks to improve posterior estimation. In this work, we propose GFlowOut to address these issues. GFlowOut leverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks. We empirically demonstrate that GFlowOut results in predictive distributions that generalize better to out-of-distribution data, and provide uncertainty estimates which lead to better performance in downstream tasks.
Reachability-Aware Laplacian Representation in Reinforcement Learning
Abstract
In Reinforcement Learning (RL), Laplacian Representation (LapRep) is a task-agnostic state representation that encodes the geometry of the environment. A desirable property of LapRep stated in prior works is that the Euclidean distance in the LapRep space roughly reflects the reachability between states, which motivates the usage of this distance for reward shaping. However, we find that LapRep does not necessarily have this property in general: two states having small distance under LapRep can actually be far away in the environment. Such mismatch would impede the learning process in reward shaping. To fix this issue, we introduce a Reachability-Aware Laplacian Representation (RA-LapRep), by properly scaling each dimension of LapRep. Despite the simplicity, we demonstrate that RA-LapRep can better capture the inter-state reachability as compared to LapRep, through both theoretical explanations and experimental results. Additionally, we show that this improvement yields a significant boost in reward shaping performance and also benefits bottleneck state discovery.
Keyword: calibration
On the Calibration of Massively Multilingual Language Models
Abstract
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer. While there has been much work in evaluating these models for their performance on a variety of tasks and languages, little attention has been paid on how well calibrated these models are with respect to the confidence in their predictions. We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages or those which are typologically diverse from English. Next, we empirically show that calibration methods like temperature scaling and label smoothing do reasonably well towards improving calibration in the zero-shot scenario. We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially. Overall, our work contributes towards building more reliable multilingual models by highlighting the issue of their miscalibration, understanding what language and model specific factors influence it, and pointing out the strategies to improve the same.
Whisker-Inspired Tactile Sensing for Contact Localization on Robot Manipulators
Authors: Michael A. Lin, Emilio Reyes, Jeannette Bohg, Mark R. Cutkosky
Abstract
Perceiving the environment through touch is important for robots to reach in cluttered environments, but devising a way to sense without disturbing objects is challenging. This work presents the design and modelling of whisker-inspired sensors that attach to the surface of a robot manipulator to sense its surrounding through light contacts. We obtain a sensor model using a calibration process that applies to straight and curved whiskers. We then propose a sensing algorithm using Bayesian filtering to localize contact points. The algorithm combines the accurate proprioceptive sensing of the robot and sensor readings from the deflections of the whiskers. Our results show that our algorithm is able to track contact points with sub-millimeter accuracy, outperforming a baseline method. Finally, we demonstrate our sensor and perception method in a real-world system where a robot moves in between free-standing objects and uses the whisker sensors to track contacts tracing object contours.
Neural Distortion Fields for Spatial Calibration of Wide Field-of-View Near-Eye Displays
Authors: Yuichi Hiroi, Kiyosato Someya, Yuta Itoh
Subjects: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
Abstract
We propose a spatial calibration method for wide Field-of-View (FoV) Near-Eye Displays (NEDs) with complex image distortions. Image distortions in NEDs can destroy the reality of the virtual object and cause sickness. To achieve distortion-free images in NEDs, it is necessary to establish a pixel-by-pixel correspondence between the viewpoint and the displayed image. Designing compact and wide-FoV NEDs requires complex optical designs. In such designs, the displayed images are subject to gaze-contingent, non-linear geometric distortions, which explicit geometric models can be difficult to represent or computationally intensive to optimize. To solve these problems, we propose Neural Distortion Field (NDF), a fully-connected deep neural network that implicitly represents display surfaces complexly distorted in spaces. NDF takes spatial position and gaze direction as input and outputs the display pixel coordinate and its intensity as perceived in the input gaze direction. We synthesize the distortion map from a novel viewpoint by querying points on the ray from the viewpoint and computing a weighted sum to project output display coordinates into an image. Experiments showed that NDF calibrates an augmented reality NED with 90$^{\circ}$ FoV with about 3.23 pixel (5.8 arcmin) median error using only 8 training viewpoints. Additionally, we confirmed that NDF calibrates more accurately than the non-linear polynomial fitting, especially around the center of the FoV.
Hard Gate Knowledge Distillation -- Leverage Calibration for Robust and Reliable Language Model
Authors: Dongkyu Lee, Zhiliang Tian, Yingxiu Zhao, Ka Chun Cheung, Nevin L. Zhang
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Abstract
In knowledge distillation, a student model is trained with supervisions from both knowledge from a teacher and observations drawn from a training data distribution. Knowledge of a teacher is considered a subject that holds inter-class relations which send a meaningful supervision to a student; hence, much effort has been put to find such knowledge to be distilled. In this paper, we explore a question that has been given little attention: "when to distill such knowledge." The question is answered in our work with the concept of model calibration; we view a teacher model not only as a source of knowledge but also as a gauge to detect miscalibration of a student. This simple and yet novel view leads to a hard gate knowledge distillation scheme that switches between learning from a teacher model and training data. We verify the gating mechanism in the context of natural language generation at both the token-level and the sentence-level. Empirical comparisons with strong baselines show that hard gate knowledge distillation not only improves model generalization, but also significantly lowers model calibration error.
Federated Calibration and Evaluation of Binary Classifiers
Authors: Graham Cormode, Igor Markov
Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Abstract
We address two major obstacles to practical use of supervised classifiers on distributed private data. Whether a classifier was trained by a federation of cooperating clients or trained centrally out of distribution, (1) the output scores must be calibrated, and (2) performance metrics must be evaluated -- all without assembling labels in one place. In particular, we show how to perform calibration and compute precision, recall, accuracy and ROC-AUC in the federated setting under three privacy models (i) secure aggregation, (ii) distributed differential privacy, (iii) local differential privacy. Our theorems and experiments clarify tradeoffs between privacy, accuracy, and data efficiency. They also help decide whether a given application has sufficient data to support federated calibration and evaluation.
A study of uncertainty quantification in overparametrized high-dimensional models
Authors: Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová
Abstract
Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent like behavior in the calibration curve of optimally regularized estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization.
GFlowOut: Dropout with Generative Flow Networks
Authors: Dianbo Liu, Moksh Jain, Bonaventure Dossou, Qianli Shen, Salem Lahlou, Anirudh Goyal, Nikolay Malkin, Chris Emezue, Dinghuai Zhang, Nadhir Hassen, Xu Ji, Kenji Kawaguchi, Yoshua Bengio
Abstract
Bayesian Inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and generalization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference and to estimate uncertainty with deep neural networks. Traditionally, the dropout mask is sampled independently from a fixed distribution. Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference. These methods face two important challenges: (a) the posterior distribution over masks can be highly multi-modal which can be difficult to approximate with standard variational inference and (b) it is not trivial to fully utilize sample-dependent information and correlation among dropout masks to improve posterior estimation. In this work, we propose GFlowOut to address these issues. GFlowOut leverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks. We empirically demonstrate that GFlowOut results in predictive distributions that generalize better to out-of-distribution data, and provide uncertainty estimates which lead to better performance in downstream tasks.
Are Deep Sequence Classifiers Good at Non-Trivial Generalization?
Authors: Francesco Cazzaro, Ariadna Quattoni, Xavier Carreras
Abstract
Recent advances in deep learning models for sequence classification have greatly improved their classification accuracy, specially when large training sets are available. However, several works have suggested that under some settings the predictions made by these models are poorly calibrated. In this work we study binary sequence classification problems and we look at model calibration from a different perspective by asking the question: Are deep learning models capable of learning the underlying target class distribution? We focus on sparse sequence classification, that is problems in which the target class is rare and compare three deep learning sequence classification models. We develop an evaluation that measures how well a classifier is learning the target class distribution. In addition, our evaluation disentangles good performance achieved by mere compression of the training sequences versus performance achieved by proper model generalization. Our results suggest that in this binary setting the deep-learning models are indeed able to learn the underlying class distribution in a non-trivial manner, i.e. by proper generalization beyond data compression.
Keyword: out of distribution detection
There is no result
Keyword: out-of-distribution detection
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
Keyword: expected calibration error
There is no result
Keyword: overconfident
Augmentation by Counterfactual Explanation -- Fixing an Overconfident Classifier
A study of uncertainty quantification in overparametrized high-dimensional models
Keyword: overconfidence
A study of uncertainty quantification in overparametrized high-dimensional models
Keyword: confidence
Probing with Noise: Unpicking the Warp and Weft of Embeddings
Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition
On the Calibration of Massively Multilingual Language Models
HuPR: A Benchmark for Human Pose Estimation Using Millimeter Wave Radar
Fast Beam Alignment via Pure Exploration in Multi-armed Bandits
A study of uncertainty quantification in overparametrized high-dimensional models
Learning Neural Radiance Fields from Multi-View Geometry
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data
ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation
Reliability-Aware Prediction via Uncertainty Learning for Person Image Retrieval
Keyword: scaling
A New Perspective for Understanding Generalization Gap of Deep Neural Networks Trained with Large Batch Sizes
On the Calibration of Massively Multilingual Language Models
Exploring Representation-Level Augmentation for Code Search
A study of uncertainty quantification in overparametrized high-dimensional models
To Signal or Not to Signal? Layering Traffic Analysis Resistance on Secure Instant Messaging
GFlowOut: Dropout with Generative Flow Networks
Reachability-Aware Laplacian Representation in Reinforcement Learning
Keyword: calibration
On the Calibration of Massively Multilingual Language Models
Whisker-Inspired Tactile Sensing for Contact Localization on Robot Manipulators
Neural Distortion Fields for Spatial Calibration of Wide Field-of-View Near-Eye Displays
Hard Gate Knowledge Distillation -- Leverage Calibration for Robust and Reliable Language Model
Federated Calibration and Evaluation of Binary Classifiers
A study of uncertainty quantification in overparametrized high-dimensional models
GFlowOut: Dropout with Generative Flow Networks
Are Deep Sequence Classifiers Good at Non-Trivial Generalization?