regression task - Githubissues

Good question. I have thought about this question before. I think simply using L1 or MSE loss is hard for us to calculate the uncertainty. To solve this problem, here are my ideas:

Assume you have a neural network where the last layer produces two values of your regression prediction, $\mu_i$ and $\sigma_i^2$, which are the mean and variance (make sure the variance is larger than 0 by using activation function such as sigmoid).

One could use NLL loss fucntion to train your network: $L = \frac{1}{N}\sum_{i=1...N}(\frac{1}{2}log2\pi \sigma_i^2 + \frac{(y_i-\mu_i)^2}{2\sigma_i^2})$

Our last layer will be a linear layer with weight matrix $W = [W_1,W_2]$ with size of (emb_size, 2) and bias $[b]$ with size of (emb_size, 1). And the mean and the variance are calculated as: $\mu_i=W_1 \cdot z(x_i) + b$ and $\sigma_i=W_2 \cdot z(x_i) + b$, and we treat $\mu_i$ as our pseudo label for $\hat{y}_i$.

So the gradient of $L_i$ of $W_1$ and $W_2$ will be: $\frac{\partial L_i}{\partial W_1} = \frac{\partial L_i}{\partial \mu_i}\frac{\partial \mu_i}{\partial W_1}$ = 0 $\frac{\partial L_i}{\partial W_2} = \frac{\partial L_i}{\partial \sigma_i}\frac{\partial \sigma_i}{\partial W_2} = (log2\pi \sigma_i) \cdot z(x_i)$

As you could see, the gradient embedding of $W_1$ is always zero which is somehow weird. However, the l2 norm of gradient embedding of $W_2$ is propotional to the variance of the prediction, and the gradient embedding is related to feature embedding $z(x_i)$ which could be used to capture the diversity. We could use gradient embedding of $W_2$ to perform BADGE algorithm and our proposed algorithm.

Feel free to discuss with me!

MengShen0709 / bmmal

regression task #3