the sigma in aleatoric uncertainty

ConanCui commented 4 years ago

Hi, I notice that you get the mu and sigma as https://github.com/ShellingFord221/My-implementation-of-What-Uncertainties-Do-We-Need-in-Bayesian-Deep-Learning-for-Computer-Vision/blob/e6ed204cd25ac995eb8ec8da701117dcd5aabb1d/classification_aleatoric.py#L81.

As I know, sigma should be larger than zero, how can the real value in logit.split to satisfy this condition.

whisney commented 4 years ago

In 《What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?》：

In practice, we train the network to predict the log variance

I think ’sigma‘ in the code is actually ’log_sigma‘, and real sigma is exp(log_sigma). But @ShellingFord221 seems to lack this step.

ShellingFord221 commented 4 years ago

Sorry for the late reply. For @whisney 's question, in regression task, the formula is (Eq. 5 in original paper):

Since sigma is in the denominators, the gradient sometimes can explode at the beginning of training. To avoid this, we predict alpha = log(sigma^2) in practice. But in classification task, the formula becomes (Eq. 12 in original paper):

Now sigma is not in the denominators, so there is no need to predict alpha = log(sigma^2), we directly predict sigma instead.

For @ConanCui 's question, you can use an absolute value layer to prevent sigma from being negative. But in my experiments, it seems that there's no obvious difference between using this layer or not. It may depend on tasks, I can't say for sure.

whisney commented 4 years ago

Sorry for the late reply. For @whisney 's question, in regression task, the formula is (Eq. 5 in original paper):

Since sigma is in the denominators, the gradient sometimes can explode at the beginning of training. To avoid this, we predict alpha = log(sigma^2) in practice. But in classification task, the formula becomes (Eq. 12 in original paper):

Now sigma is not in the denominators, so there is no need to predict alpha = log(sigma^2), we directly predict sigma instead.

For @ConanCui 's question, you can use an absolute value layer to prevent sigma from being negative. But in my experiments, it seems that there's no obvious difference between using this layer or not. It may depend on tasks, I can't say for sure.

Thank you for your reply. https://github.com/tanyanair/segmentation_uncertainty/blob/master/bunet/utils/tf_metrics.py#L22 QQ截图20200522151822 This is the part of the official code of a 2018MICCAI paper about Aleatoric uncertainty(is called 'Prediction Variance' in the paper). It is a segmentation task.In the code, the author predicted log_sigma and executed exp later. So, I'm not sure which of you is right.

ShellingFord221 commented 4 years ago

I think the easiest way for this question is to observe that whether the training process is stable. If gradient explodes at the beginning, you should predict alpha = log(sigma^2) rather than sigma. If not, I think there is no need to predict sigma in other form.

whisney commented 4 years ago

I think the easiest way for this question is to observe that whether the training process is stable. If gradient explodes at the beginning, you should predict alpha = log(sigma^2) rather than sigma. If not, I think there is no need to predict sigma in other form.

In my opinion，in terms of network structure(output directly connected to conv and has no activation.), we cannot guarantee that the network output is greater than 0, but sigma^2 must be >= 0. So we only expect the network to predict log(sigma^2) and execute exp make it > 0.

But The network has strong learning ability. This means that the network will learn to output sigma^2 if we do not execute exp(although there is no guarantee from the network structure that it must be positive, the network will tend to output a value of > =0.) On the contrary, if we execute exp, the network will learn to output log(sigma^2).(more stable in theory)

I don't know if this description is correct, thank you.

hfutyanhuan commented 3 years ago

I think the easiest way for this question is to observe that whether the training process is stable. If gradient explodes at the beginning, you should predict alpha = log(sigma^2) rather than sigma. If not, I think there is no need to predict sigma in other form.

I think it should be right. The time has passed too long. Are you still studying the uncertainty estimation problem? Assuming that the noise term is a multivariate Normal distribution, how to construct a full covariance matrix to represent the latent distribution? How is it reflected in the code?

ShellingFord221 commented 3 years ago

Hi, when assuming the output of the network is a multi-variate Gaussian distribution, we also assume that features are independent of each other. Therefore, the covariance matrix of our multi-variate Gaussian distribution is a diagonal matrix, with one variance element for each feature on the diagonal. In the code, we implement it as sigma in:

https://github.com/ShellingFord221/My-implementation-of-What-Uncertainties-Do-We-Need-in-Bayesian-Deep-Learning-for-Computer-Vision/blob/61f14395b189264f276d683972dd9c5786c0d55a/classification_aleatoric.py#L102

Then we use sigma as well as mu to draw samples from this multi-variate Gaussian distribution to generate multiple predictions about the input (i.e. Eq. 12 in the original paper):

https://github.com/ShellingFord221/My-implementation-of-What-Uncertainties-Do-We-Need-in-Bayesian-Deep-Learning-for-Computer-Vision/blob/61f14395b189264f276d683972dd9c5786c0d55a/classification_aleatoric.py#L108

You can also see the discussion in Issue #1 . Hope that helps.

hfutyanhuan commented 3 years ago

the covariance matrix of our multi-variate Gaussian distribution is a diagonal matrix

Assuming that features are dependent of each other, the covariance matrix of our multi-variate Gaussian distribution is a full matrix, how to get the final logit?

ShellingFord221 commented 3 years ago

Normally we do not assume features are dependent of each other. They are orthogonal. If they are skewed, I think there are two ways to tackle this problem. One is to reduce dimension of vectors to make sure left features are independent. The other one is to learn the relationship between dependent features then make predictions according to full covariance matrix. This can be done by Gaussian Process, which sets kernels to accommodate feature dependence then optimizes hyper-parameters in kernels. This may beyond the scope of deep learning, but may provide a solution to solve your problem.

hfutyanhuan commented 3 years ago

Normally we do not assume features are dependent of each other. They are orthogonal. If they are skewed, I think there are two ways to tackle this problem. One is to reduce dimension of vectors to make sure left features are independent. The other one is to learn the relationship between dependent features then make predictions according to full covariance matrix. This can be done by Gaussian Process, which sets kernels to accommodate feature dependence then optimizes hyper-parameters in kernels. This may beyond the scope of deep learning, but may provide a solution to solve your problem.

Normally we do not assume features are dependent of each other. They are orthogonal. If they are skewed, I think there are two ways to tackle this problem. One is to reduce dimension of vectors to make sure left features are independent. The other one is to learn the relationship between dependent features then make predictions according to full covariance matrix. This can be done by Gaussian Process, which sets kernels to accommodate feature dependence then optimizes hyper-parameters in kernels. This may beyond the scope of deep learning, but may provide a solution to solve your problem.

I don’t know if you have read the "Correlated Input-Dependent Label Noise in Large-Scale Image Classification" and "Stochastic Segmentation Networks: Modelling Spatially Correlated Aleatoric Uncertainty" papers, which assume that the features are dependent and make a low-rank approximation. I just don’t understand the logit generation, but thank you very much for your answer, thank you

hfutyanhuan commented 3 years ago

Normally we do not assume features are dependent of each other. They are orthogonal. If they are skewed, I think there are two ways to tackle this problem. One is to reduce dimension of vectors to make sure left features are independent. The other one is to learn the relationship between dependent features then make predictions according to full covariance matrix. This can be done by Gaussian Process, which sets kernels to accommodate feature dependence then optimizes hyper-parameters in kernels. This may beyond the scope of deep learning, but may provide a solution to solve your problem.

In addition, have you tested the effectiveness of this method with other datasets? I replaced other datasets, such as face, and found that this method does not bring performance improvements. Does it have a lot to do with the choice of backbone network? I use resnet18 and add dropout layer behind each layer.

ShellingFord221 / My-implementation-of-What-Uncertainties-Do-We-Need-in-Bayesian-Deep-Learning-for-Computer-Vision

the sigma in aleatoric uncertainty #2