Add covariance of the prediction to RNN

kouroshD commented 4 years ago

In this issue, I would like to explore the state of the art for adding the possibility to not only predicting the future motion but also the covariance of the prediction, i.e., how much we can trust to the predicted time series. The first idea is coming from a talk given by Davide Scaramuzza given in Erzelli some months back, whereby they estimate the uncertainty of the prediction.

kouroshD commented 4 years ago

In this comment, I provide a brief description of the literature in for predicting the uncertainty. I will describe based on the following paper:

Abbas Khosravi, Saeid Nahavandi, Doug Creighton, Amir F. Atiya, "Comprehensive Review of Neural Network-Based Prediction Intervals and New Advances", IEEE TRANSACTIONS ON NEURAL NETWORKS, 2011. Some other relevant papers:
Kaufmann E, Gehrig M, Foehn P, Ranftl R, Dosovitskiy A, Koltun V, Scaramuzza D. Beauty and the beast: Optimal methods meet learning for drone racing. In2019 International Conference on Robotics and Automation (ICRA) 2019 May 20 (pp. 690-696).
Nix DA, Weigend AS. Estimating the mean and variance of the target probability distribution. InProceedings of 1994 ieee international conference on neural networks (ICNN'94) 1994 Jun 28 (Vol. 1, pp. 55-60). IEEE.
Carney JG, Cunningham P, Bhagwan U. Confidence and prediction intervals for neural network ensembles. InIJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No. 99CH36339) 1999 Jul 10 (Vol. 2, pp. 1215-1218). IEEE.

Consider we have the pairs of input ($x$) and output(target $t$) vectors data $(x, t)$, we have: $$ (1) t_i(x)= f_i(x) + e_i(x); i: 1, ..., N $$ which $t_i(x)$ is the measured target, $e_i(x)$ is the noise, and $f(x)$ is the true regression. $e_i(x)$ is identically and independently distributed. The true regression mean $\hat{y}_i$ is estimated by NN as following $$ (2) \hat{y}_i = \phi (x_i;T ) $$ where $T$ is the training set and $\phi$ is the nonlinear mapping. Using (1) and (2) we have: $$ (3) t_i(x)- \hat{y}_i= f_i(x) - \hat{y}_i + e_i(x) $$ Prediction interval identifies the uncertainty associated with the difference between the measured values and predicted values, i.e., relating the probability distribution $P(t_i| \hat{y}_i)$.

There are four different methods to approximate the PI:

Delta Method

Let's consider $ y_i =f(x_i, \omega^{\ast}) $ where $ \omega^{\ast} $ is the set of optimal weights.

In its neighborhood we will have

$$ (4) \hat{y}_i = f(x_i, \omega^{\ast})+ g_i^T (\hat{\omega} - \omega^{\ast}) $$

in which $g_i = \frac{f(x_i, \omega^{\ast})}{d \omega^{\ast}} $ using equation (3) and (4) we have: $$ t_i(x)- \hat{y}_i= (y_i + e_i(x)) -(f(x_i, \omega^{\ast})+ g_i^T (\hat{\omega} - \omega^{\ast})) = e_i(x) - g_i^T (\hat{\omega} - \omega^{\ast}) $$

There is an unmentioned assumption here, in which $y_i=f(x_i, \omega^{\ast})$ . To me this assumption is an approximation, may not be valid. So, we will have:

$$ var(t_i(x)- \hat{y}_i)= var(e_i(x)) + var( g_i^T ( \hat{\omega} - \omega^{\ast} ) ) $$

Elaborating this, we will have $(1-\alpha)$% PI:

$$ \hat{y}_i { \pm } t^{n-p, 1 - \frac{\alpha}{2}} \sqrt{ 1 + g_i^T (F^T F)^{-1} g_i } $$

where $F$ is the Jacobian matrix of the NN model and $t^{n-p, 1 - \frac{\alpha}{2}}$ is the $\frac{\alpha}{2}$ quarentile of a cumulative t-distribution function of $n-p$ DoFs.

Mean-Variance Estimation (MVE) Method

In this case, we assume a normal distrusted error around $yi$; we can identidy the following cost function error: $$ C{MVE}= \frac{1}{2} \sum_{i=1}^{n} \left [ ln( \hat{\sigma}_i^2 ) + \frac{ (t_i -\hat{y}_i)^2 }{ \hat{\sigma}_i^2 } \right ] $$

Three phases training technique has been proposed. First we identify the wights of network to identify the wights $\omegay$ to estimate the outputs using an error-based cost function. Later, finding $\omega{\sigma}$, by minimizing the cost function introduced before. Finally, resampling the and apply simultaneously adjustment of both network parameters suing previous introduced cost function. The drawback is it assumes the NN finds the true mean of the targets, i.e., $y_i$. Finally, it finds the covariance of the noise in formula (1).

Bootstrap Method the most common metrics. Its schematic is as following:

$B$ training dataset are resampled from the original dataset, and mean and covariance of the outputs are found out, i.e., $\hat{y}i$ and $\sigma^2{ \hat{y}_i }$ of the i'th sample.

To construct the PI, the variance of the errors is calculated using formula 1, i.e., $\sigma^2_{\hat{\epsilon}_i }$:

$$ \sigma^2{\hat{\epsilon}} \simeq E( (t-\hat{y})^2 ) - \sigma{\hat{y}^2} $$

Then, we define a similar cost function of MVE to estimate the values of $\sigma^2_{\hat{\epsilon}_i }$.

The last method to approximate is the Bayesian method, which is described in the paper.

kouroshD commented 4 years ago

Differently from method described in the previous comment, one point to consider is that, our problem is a time series problem. We define our problem as: Predicting the uncertainty between the predicted data from their truth values, i.e., minimzing the following objective function: $$ \hat{y}^{t} = \phi (x^0| TrainingSet ); 1<t<T $$ where $\hat{y}^{t}$ is the predicted target vector at time $t$.

We identify the uncertainty of the prediction at time $t$ using MSE as follows: $$ s^t = (y^t- \hat{y}^{t})^T (y^t - \hat{y}^{t}) $$ And the objective will be: $$ \hat{s} = arg min \frac{1}{2} \sum{t=1}^{T} \sum{i=1}^{M} ( s^{t,i} - \hat{s}^{t,i} ) $$ where $i$ is the sample number and $t$ is the time in future we want to predict. By summing over all the samples at time $t$ , the output approximate the average of the MSE at each time $t$, i.e., estimate of the covariance at each moment $t$.

However, we do not have $\hat{y}^{t}$ in advance. To resolve this problem, I was thinking of two different approaches:

having two sequential RNN, the first one finds $\hat{y}^t$ and after computing this, the second one computes $ \hat{s}^t$.
second approach: having one RNN, concatenate $s^t$ with the $y^t$ outputs. Again the problem is since we do not have $\hat{y}^t$ we can cannot compute the $s^t$. A solution to this problem which comes to me, is initializing $s^t$ with high values, at the end of each epoch, we compute $\hat{y}^t$ according to the updated network, then compute and update $s^t$ at the end of each epoch. This method may result in instability, but I am not sure about it.

First approach, using RNN to compute the $\hat{s}^t$ is interesting, since covariance at each moment $t$ will be a function of the hidden state at previous time $a^{t-1}$, similar to update methods in Gauss-Markov stochastic process.

P.S. similar to bootstrap method, we use exponential function in the output layer, to enforce the positive values for $s^t$.

kouroshD commented 4 years ago

@DanielePucci @raffaello-camoriano Let me know what do you think about the proposed approach wrt short literature review provided.

kouroshD commented 4 years ago

The implementation of what I have stated in the previous comment can be found in this commit https://github.com/dic-iit/element_human-action-intention-recognition/commit/3070f161837635003df2a60438905b0678391bb0 .

kouroshD commented 4 years ago

Using the test has been done in paper Nix94, in the experiment I have tried to predict the covariance of a dataset. So I have generated the following dataset using amplitude-modulation equation: $$ f(x)=m(x) sin(\omega{c} x);m(x)= sin(\omega{m} x); $$ The output $y$ is: $$ y= f(x) + n(x) $$ where $n(x)$ is the zero-mean Gaussian noise with the variance $\sigma^2(x)$ according to: $$ \sigma^2(x) = 0.02 + 0.02 \times (1-m(x))^2 $$ In the experiments we consider $\omega_c =5$ and $\omega_m =4$.

kouroshD commented 4 years ago

In case, the prediction of time series is perfect(!), it will predict the $f(x)$. So, in this experiment, I do not consider the prediction of $y$ and I only use $f(x)$ and $y$ to compute the covariance and try to predict it. In these experiments, I have considered the derivative of $y$ is as well, with the similar Gaussian noise distribution of $y$. Here is how it looks like the data set.

Figure_6

Figure_7

And Following is the input to the network:

Figure_4

Figure_5

And here are the result of the network:

Learning curve:

Figure_1

The real outputs and the estimated one:

Figure_2

Figure_3

kouroshD commented 4 years ago

I have done a new test, this time using the output of the predicted output and the truth values to compute the uncertainty of the prediction. I used following the parameters mentioned in this comment: https://github.com/dic-iit/element_human-action-intention-recognition/issues/11#issuecomment-614447712 . The output0 in the following figures are the $y$ and output1 is the derivative of $y$. This time the covariance of $\dot{y}$ is half of the covariance of $y$. The network for predicting the output:

the learning curve: pred-LearningCurve