Abstract
We propose to utilize gradients for detecting adversarial and out-of-distribution samples. We introduce confounding labels -- labels that differ from normal labels seen during training -- in gradient generation to probe the effective expressivity of neural networks. Gradients depict the amount of change required for a model to properly represent given inputs, providing insight into the representational power of the model established by network architectural properties as well as training data. By introducing a label of different design, we remove the dependency on ground truth labels for gradient generation during inference. We show that our gradient-based approach allows for capturing the anomaly in inputs based on the effective expressivity of the models with no hyperparameter tuning or additional processing, and outperforms state-of-the-art methods for adversarial and out-of-distribution detection.
Keyword: expected calibration error
There is no result
Keyword: overconfident
On Calibrated Model Uncertainty in Deep Learning
Authors: Biraja Ghoshal, Allan Tucker
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Abstract
Estimated uncertainty by approximate posteriors in Bayesian neural networks are prone to miscalibration, which leads to overconfident predictions in critical tasks that have a clear asymmetric cost or significant losses. Here, we extend the approximate inference for the loss-calibrated Bayesian framework to dropweights based Bayesian neural networks by maximising expected utility over a model posterior to calibrate uncertainty in deep learning. Furthermore, we show that decisions informed by loss-calibrated uncertainty can improve diagnostic performance to a greater extent than straightforward alternatives. We propose Maximum Uncertainty Calibration Error (MUCE) as a metric to measure calibrated confidence, in addition to its prediction especially for high-risk applications, where the goal is to minimise the worst-case deviation between error and estimated uncertainty. In experiments, we show the correlation between error in prediction and estimated uncertainty by interpreting Wasserstein distance as the accuracy of prediction. We evaluated the effectiveness of our approach to detecting Covid-19 from X-Ray images. Experimental results show that our method reduces miscalibration considerably, without impacting the models accuracy and improves reliability of computer-based diagnostics.
Keyword: overconfidence
There is no result
Keyword: confidence
On Calibrated Model Uncertainty in Deep Learning
Authors: Biraja Ghoshal, Allan Tucker
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Abstract
Estimated uncertainty by approximate posteriors in Bayesian neural networks are prone to miscalibration, which leads to overconfident predictions in critical tasks that have a clear asymmetric cost or significant losses. Here, we extend the approximate inference for the loss-calibrated Bayesian framework to dropweights based Bayesian neural networks by maximising expected utility over a model posterior to calibrate uncertainty in deep learning. Furthermore, we show that decisions informed by loss-calibrated uncertainty can improve diagnostic performance to a greater extent than straightforward alternatives. We propose Maximum Uncertainty Calibration Error (MUCE) as a metric to measure calibrated confidence, in addition to its prediction especially for high-risk applications, where the goal is to minimise the worst-case deviation between error and estimated uncertainty. In experiments, we show the correlation between error in prediction and estimated uncertainty by interpreting Wasserstein distance as the accuracy of prediction. We evaluated the effectiveness of our approach to detecting Covid-19 from X-Ray images. Experimental results show that our method reduces miscalibration considerably, without impacting the models accuracy and improves reliability of computer-based diagnostics.
High-Resolution Bathymetric Reconstruction From Sidescan Sonar With Deep Neural Networks
Abstract
We propose a novel data-driven approach for high-resolution bathymetric reconstruction from sidescan. Sidescan sonar (SSS) intensities as a function of range do contain some information about the slope of the seabed. However, that information must be inferred. Additionally, the navigation system provides the estimated trajectory, and normally the altitude along this trajectory is also available. From these we obtain a very coarse seabed bathymetry as an input. This is then combined with the indirect but high-resolution seabed slope information from the sidescan to estimate the full bathymetry. This sparse depth could be acquired by single-beam echo sounder, Doppler Velocity Log (DVL), other bottom tracking sensors or bottom tracking algorithm from sidescan itself. In our work, a fully convolutional network is used to estimate the depth contour and its aleatoric uncertainty from the sidescan images and sparse depth in an end-to-end fashion. The estimated depth is then used together with the range to calculate the point's 3D location on the seafloor. A high-quality bathymetric map can be reconstructed after fusing the depth predictions and the corresponding confidence measures from the neural networks. We show the improvement of the bathymetric map gained by using sparse depths with sidescan over estimates with sidescan alone. We also show the benefit of confidence weighting when fusing multiple bathymetric estimates into a single map.
Abstract
Causal bandit problem integrates causal inference with multi-armed bandits. The pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we can choose to either intervene one variable or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide first gap-dependent fully adaptive pure exploration algorithms on three types of causal models including parallel graphs, general graphs with small number of backdoor parents, and binary generalized linear models. Our algorithms improve both prior causal bandit algorithms, which are not adaptive to reward gaps, and prior adaptive pure exploration algorithms, which do not utilize the special features of causal bandits.
Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination
Abstract
The learned policy of model-free offline reinforcement learning (RL) methods is often constrained to stay within the support of datasets to avoid possible dangerous out-of-distribution actions or states, making it challenging to handle out-of-support region. Model-based RL methods offer a richer dataset and benefit generalization by generating imaginary trajectories with either trained forward or reverse dynamics model. However, the imagined transitions may be inaccurate, thus downgrading the performance of the underlying offline RL method. In this paper, we propose to augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. We introduce conservatism by trusting samples that the forward model and backward model agree on. Our method, confidence-aware bidirectional offline model-based imagination, generates reliable samples and can be combined with any model-free offline RL method. Experimental results on the D4RL benchmarks demonstrate that our method significantly boosts the performance of existing model-free offline RL algorithms and achieves competitive or better scores against baseline methods.
Abstract
This paper studies formal synthesis of controllers for continuous-space systems with unknown dynamics to satisfy requirements expressed as linear temporal logic formulas. Formal abstraction-based synthesis schemes rely on a precise mathematical model of the system to build a finite abstract model, which is then used to design a controller. The abstraction-based schemes are not applicable when the dynamics of the system are unknown. We propose a data-driven approach that computes the growth bound of the system using a finite number of trajectories. The growth bound together with the sampled trajectories are then used to construct the abstraction and synthesise a controller. Our approach casts the computation of the growth bound as a robust convex optimisation program (RCP). Since the unknown dynamics appear in the optimisation, we formulate a scenario convex program (SCP) corresponding to the RCP using a finite number of sampled trajectories. We establish a sample complexity result that gives a lower bound for the number of sampled trajectories to guarantee the correctness of the growth bound computed from the SCP with a given confidence. We also provide a sample complexity result for the satisfaction of the specification on the system in closed loop with the designed controller for a given confidence. Our results are founded on estimating a bound on the Lipschitz constant of the system and provide guarantees on satisfaction of both finite and infinite-horizon specifications. We show that our data-driven approach can be readily used as a model-free abstraction refinement scheme by modifying the formulation of the growth bound and providing similar sample complexity results. The performance of our approach is shown on three case studies.
Self-Adaptive Label Augmentation for Semi-supervised Few-shot Classification
Authors: Xueliang Wang, Jianyu Cai, Shuiwang Ji, Houqiang Li, Feng Wu, Jie Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Abstract
Few-shot classification aims to learn a model that can generalize well to new tasks when only a few labeled samples are available. To make use of unlabeled data that are more abundantly available in real applications, Ren et al. \shortcite{ren2018meta} propose a semi-supervised few-shot classification method that assigns an appropriate label to each unlabeled sample by a manually defined metric. However, the manually defined metric fails to capture the intrinsic property in data. In this paper, we propose a \textbf{S}elf-\textbf{A}daptive \textbf{L}abel \textbf{A}ugmentation approach, called \textbf{SALA}, for semi-supervised few-shot classification. A major novelty of SALA is the task-adaptive metric, which can learn the metric adaptively for different tasks in an end-to-end fashion. Another appealing feature of SALA is a progressive neighbor selection strategy, which selects unlabeled data with high confidence progressively through the training phase. Experiments demonstrate that SALA outperforms several state-of-the-art methods for semi-supervised few-shot classification on benchmark datasets.
Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training
Abstract
Recent studies have shown that the benefits provided by self-supervised pre-training and self-training (pseudo-labeling) are complementary. Semi-supervised fine-tuning strategies under the pre-training framework, however, remain insufficiently studied. Besides, modern semi-supervised speech recognition algorithms either treat unlabeled data indiscriminately or filter out noisy samples with a confidence threshold. The dissimilarities among different unlabeled data are often ignored. In this paper, we propose Censer, a semi-supervised speech recognition algorithm based on self-supervised pre-training to maximize the utilization of unlabeled data. The pre-training stage of Censer adopts wav2vec2.0 and the fine-tuning stage employs an improved semi-supervised learning algorithm from slimIPL, which leverages unlabeled data progressively according to their pseudo labels' qualities. We also incorporate a temporal pseudo label pool and an exponential moving average to control the pseudo labels' update frequency and to avoid model divergence. Experimental results on Libri-Light and LibriSpeech datasets manifest our proposed method achieves better performance compared to existing approaches while being more unified.
Balancing Cost and Quality: An Exploration of Human-in-the-loop Frameworks for Automated Short Answer Scoring
Abstract
Short answer scoring (SAS) is the task of grading short text written by a learner. In recent years, deep-learning-based approaches have substantially improved the performance of SAS models, but how to guarantee high-quality predictions still remains a critical issue when applying such models to the education field. Towards guaranteeing high-quality predictions, we present the first study of exploring the use of human-in-the-loop framework for minimizing the grading cost while guaranteeing the grading quality by allowing a SAS model to share the grading task with a human grader. Specifically, by introducing a confidence estimation method for indicating the reliability of the model predictions, one can guarantee the scoring quality by utilizing only predictions with high reliability for the scoring results and casting predictions with low reliability to human graders. In our experiments, we investigate the feasibility of the proposed framework using multiple confidence estimation methods and multiple SAS datasets. We find that our human-in-the-loop framework allows automatic scoring models and human graders to achieve the target scoring quality.
Keyword: scaling
Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks
Authors: Clemens JS Schaefer, Siddharth Joshi, Shan Li, Raul Blazquez
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Abstract
The large computing and memory cost of deep neural networks (DNNs) often precludes their use in resource-constrained devices. Quantizing the parameters and operations to lower bit-precision offers substantial memory and energy savings for neural network inference, facilitating the use of DNNs on edge computing platforms. Recent efforts at quantizing DNNs have employed a range of techniques encompassing progressive quantization, step-size adaptation, and gradient scaling. This paper proposes a new quantization approach for mixed precision convolutional neural networks (CNNs) targeting edge-computing. Our method establishes a new pareto frontier in model accuracy and memory footprint demonstrating a range of quantized models, delivering best-in-class accuracy below 4.3 MB of weights (wgts.) and activations (acts.). Our main contributions are: (i) hardware-aware heterogeneous differentiable quantization with tensor-sliced learned precision, (ii) targeted gradient modification for wgts. and acts. to mitigate quantization errors, and (iii) a multi-phase learning schedule to address instability in learning arising from updates to the learned quantizer and model parameters. We demonstrate the effectiveness of our techniques on the ImageNet dataset across a range of models including EfficientNet-Lite0 (e.g., 4.14MB of wgts. and acts. at 67.66% accuracy) and MobileNetV2 (e.g., 3.51MB wgts. and acts. at 65.39% accuracy).
SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
Authors: Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C. Mozer, Thomas Kipf
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Abstract
The visual world can be parsimoniously characterized in terms of distinct entities with sparse interactions. Discovering this compositional structure in dynamic visual scenes has proven challenging for end-to-end computer vision approaches unless explicit instance-level supervision is provided. Slot-based models leveraging motion cues have recently shown great promise in learning to represent, segment, and track objects without direct supervision, but they still fail to scale to complex real-world multi-object videos. In an effort to bridge this gap, we take inspiration from human development and hypothesize that information about scene geometry in the form of depth signals can facilitate object-centric learning. We introduce SAVi++, an object-centric video model which is trained to predict depth signals from a slot-based video representation. By further leveraging best practices for model scaling, we are able to train SAVi++ to segment complex dynamic scenes recorded with moving cameras, containing both static and moving objects of diverse appearance on naturalistic backgrounds, without the need for segmentation supervision. Finally, we demonstrate that by using sparse depth signals obtained from LiDAR, SAVi++ is able to learn emergent object segmentation and tracking from videos in the real-world Waymo Open dataset.
Double Sampling Randomized Smoothing
Authors: Linyi Li, Jiawei Zhang, Tao Xie, Bo Li
Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Statistics Theory (math.ST)
Abstract
Neural networks (NNs) are known to be vulnerable against adversarial perturbations, and thus there is a line of work aiming to provide robustness certification for NNs, such as randomized smoothing, which samples smoothing noises from a certain distribution to certify the robustness for a smoothed classifier. However, as previous work shows, the certified robust radius in randomized smoothing suffers from scaling to large datasets ("curse of dimensionality"). To overcome this hurdle, we propose a Double Sampling Randomized Smoothing (DSRS) framework, which exploits the sampled probability from an additional smoothing distribution to tighten the robustness certification of the previous smoothed classifier. Theoretically, under mild assumptions, we prove that DSRS can certify $\Theta(\sqrt d)$ robust radius under $\ell_2$ norm where $d$ is the input dimension, which implies that DSRS may be able to break the curse of dimensionality of randomized smoothing. We instantiate DSRS for a generalized family of Gaussian smoothing and propose an efficient and sound computing method based on customized dual optimization considering sampling error. Extensive experiments on MNIST, CIFAR-10, and ImageNet verify our theory and show that DSRS certifies larger robust radii than existing baselines consistently under different settings. Code is available at https://github.com/llylly/DSRS.
Neural tangent kernel analysis of shallow $α$-Stable ReLU neural networks
Abstract
There is a recent literature on large-width properties of Gaussian neural networks (NNs), i.e. NNs whose weights are distributed according to Gaussian distributions. Two popular problems are: i) the study of the large-width behaviour of NNs, which provided a characterization of the infinitely wide limit of a rescaled NN in terms of a Gaussian process; ii) the study of the large-width training dynamics of NNs, which set forth an equivalence between training the rescaled NN and performing a kernel regression with a deterministic kernel referred to as the neural tangent kernel (NTK). In this paper, we consider these problems for $\alpha$-Stable NNs, which generalize Gaussian NNs by assuming that the NN's weights are distributed as $\alpha$-Stable distributions with $\alpha\in(0,2]$, i.e. distributions with heavy tails. For shallow $\alpha$-Stable NNs with a ReLU activation function, we show that if the NN's width goes to infinity then a rescaled NN converges weakly to an $\alpha$-Stable process, i.e. a stochastic process with $\alpha$-Stable finite-dimensional distributions. As a novelty with respect to the Gaussian setting, in the $\alpha$-Stable setting the choice of the activation function affects the scaling of the NN, that is: to achieve the infinitely wide $\alpha$-Stable process, the ReLU function requires an additional logarithmic scaling with respect to sub-linear functions. Then, our main contribution is the NTK analysis of shallow $\alpha$-Stable ReLU-NNs, which leads to an equivalence between training a rescaled NN and performing a kernel regression with an $(\alpha/2)$-Stable random kernel. The randomness of such a kernel is a further novelty with respect to the Gaussian setting, that is: in the $\alpha$-Stable setting the randomness of the NN at initialization does not vanish in the NTK analysis, thus inducing a distribution for the kernel of the underlying kernel regression.
Abstract
Multilinear algebra kernel performance on modern massively-parallel systems is determined mainly by data movement. However, deriving data movement-optimal distributed schedules for programs with many high-dimensional inputs is a notoriously hard problem. State-of-the-art libraries rely on heuristics and often fall back to suboptimal tensor folding and BLAS calls. We present Deinsum, an automated framework for distributed multilinear algebra computations expressed in Einstein notation, based on rigorous mathematical tools to address this problem. Our framework automatically derives data movement-optimal tiling and generates corresponding distributed schedules, further optimizing the performance of local computations by increasing their arithmetic intensity. To show the benefits of our approach, we test it on two important tensor kernel classes: Matricized Tensor Times Khatri-Rao Products and Tensor Times Matrix chains. We show performance results and scaling on the Piz Daint supercomputer, with up to 19x speedup over state-of-the-art solutions on 512 nodes.
On Scaled Methods for Saddle Point Problems
Authors: Aleksandr Beznosikov, Aibek Alanov, Dmitry Kovalev, Martin Takáč, Alexander Gasnikov
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
Abstract
Methods with adaptive scaling of different features play a key role in solving saddle point problems, primarily due to Adam's popularity for solving adversarial machine learning problems, including GANS training. This paper carries out a theoretical analysis of the following scaling techniques for solving SPPs: the well-known Adam and RmsProp scaling and the newer AdaHessian and OASIS based on Hutchison approximation. We use the Extra Gradient and its improved version with negative momentum as the basic method. Experimental studies on GANs show good applicability not only for Adam, but also for other less popular methods.
Scalable First-Order Bayesian Optimization via Structured Automatic Differentiation
Authors: Sebastian Ament, Carla Gomes
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Mathematical Software (cs.MS); Optimization and Control (math.OC); Machine Learning (stat.ML)
Abstract
Bayesian Optimization (BO) has shown great promise for the global optimization of functions that are expensive to evaluate, but despite many successes, standard approaches can struggle in high dimensions. To improve the performance of BO, prior work suggested incorporating gradient information into a Gaussian process surrogate of the objective, giving rise to kernel matrices of size $nd \times nd$ for $n$ observations in $d$ dimensions. Na\"ively multiplying with (resp. inverting) these matrices requires $\mathcal{O}(n^2d^2)$ (resp. $\mathcal{O}(n^3d^3$)) operations, which becomes infeasible for moderate dimensions and sample sizes. Here, we observe that a wide range of kernels gives rise to structured matrices, enabling an exact $\mathcal{O}(n^2d)$ matrix-vector multiply for gradient observations and $\mathcal{O}(n^2d^2)$ for Hessian observations. Beyond canonical kernel classes, we derive a programmatic approach to leveraging this type of structure for transformations and combinations of the discussed kernel classes, which constitutes a structure-aware automatic differentiation algorithm. Our methods apply to virtually all canonical kernels and automatically extend to complex kernels, like the neural network, radial basis function network, and spectral mixture kernels without any additional derivations, enabling flexible, problem-dependent modeling while scaling first-order BO to high $d$.
Keyword: calibration
On Calibrated Model Uncertainty in Deep Learning
Authors: Biraja Ghoshal, Allan Tucker
Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
Abstract
Estimated uncertainty by approximate posteriors in Bayesian neural networks are prone to miscalibration, which leads to overconfident predictions in critical tasks that have a clear asymmetric cost or significant losses. Here, we extend the approximate inference for the loss-calibrated Bayesian framework to dropweights based Bayesian neural networks by maximising expected utility over a model posterior to calibrate uncertainty in deep learning. Furthermore, we show that decisions informed by loss-calibrated uncertainty can improve diagnostic performance to a greater extent than straightforward alternatives. We propose Maximum Uncertainty Calibration Error (MUCE) as a metric to measure calibrated confidence, in addition to its prediction especially for high-risk applications, where the goal is to minimise the worst-case deviation between error and estimated uncertainty. In experiments, we show the correlation between error in prediction and estimated uncertainty by interpreting Wasserstein distance as the accuracy of prediction. We evaluated the effectiveness of our approach to detecting Covid-19 from X-Ray images. Experimental results show that our method reduces miscalibration considerably, without impacting the models accuracy and improves reliability of computer-based diagnostics.
Domain Generalization via Selective Consistency Regularization for Time Series Classification
Abstract
Domain generalization methods aim to learn models robust to domain shift with data from a limited number of source domains and without access to target domain samples during training. Popular domain alignment methods for domain generalization seek to extract domain-invariant features by minimizing the discrepancy between feature distributions across all domains, disregarding inter-domain relationships. In this paper, we instead propose a novel representation learning methodology that selectively enforces prediction consistency between source domains estimated to be closely-related. Specifically, we hypothesize that domains share different class-informative representations, so instead of aligning all domains which can cause negative transfer, we only regularize the discrepancy between closely-related domains. We apply our method to time-series classification tasks and conduct comprehensive experiments on three public real-world datasets. Our method significantly improves over the baseline and achieves better or competitive performance in comparison with state-of-the-art methods in terms of both accuracy and model calibration.
PROFHIT: Probabilistic Robust Forecasting for Hierarchical Time-series
Authors: Harshavardhan Kamarthi, Lingkai Kong, Alexander Rodríguez, Chao Zhang, B. Aditya Prakash
Abstract
Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have underlying hierarchical relations. Most methods focus on point predictions and do not provide well-calibrated probabilistic forecasts distributions. Recent state-of-art probabilistic forecasting methods also impose hierarchical relations on point predictions and samples of distribution which does not account for coherency of forecast distributions. Previous works also silently assume that datasets are always consistent with given hierarchical relations and do not adapt to real-world datasets that show deviation from this assumption. We close both these gaps and propose PROFHIT, which is a fully probabilistic hierarchical forecasting model that jointly models forecast distribution of entire hierarchy. PROFHIT uses a flexible probabilistic Bayesian approach and introduces a novel Distributional Coherency regularization to learn from hierarchical relations for entire forecast distribution that enables robust and calibrated forecasts as well as adapt to datasets of varying hierarchical consistency. On evaluating PROFHIT over wide range of datasets, we observed 41-88% better performance in accuracy and calibration. Due to modeling the coherency over full distribution, we observed that PROFHIT can robustly provide reliable forecasts even if up to 10% of input time-series data is missing where other methods' performance severely degrade by over 70%.
Keyword: out of distribution detection
There is no result
Keyword: out-of-distribution detection
Gradient-Based Adversarial and Out-of-Distribution Detection
Keyword: expected calibration error
There is no result
Keyword: overconfident
On Calibrated Model Uncertainty in Deep Learning
Keyword: overconfidence
There is no result
Keyword: confidence
On Calibrated Model Uncertainty in Deep Learning
High-Resolution Bathymetric Reconstruction From Sidescan Sonar With Deep Neural Networks
Pure Exploration of Causal Bandits
Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination
Data-Driven Abstraction-Based Control Synthesis
Self-Adaptive Label Augmentation for Semi-supervised Few-shot Classification
Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training
Balancing Cost and Quality: An Exploration of Human-in-the-loop Frameworks for Automated Short Answer Scoring
Keyword: scaling
Edge Inference with Fully Differentiable Quantized Mixed Precision Neural Networks
SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
Double Sampling Randomized Smoothing
Neural tangent kernel analysis of shallow $α$-Stable ReLU neural networks
Deinsum: Practically I/O Optimal Multilinear Algebra
On Scaled Methods for Saddle Point Problems
Scalable First-Order Bayesian Optimization via Structured Automatic Differentiation
Keyword: calibration
On Calibrated Model Uncertainty in Deep Learning
Domain Generalization via Selective Consistency Regularization for Time Series Classification
PROFHIT: Probabilistic Robust Forecasting for Hierarchical Time-series