JavierAntoran / Bayesian-Neural-Networks

Pytorch implementations of Bayes By Backprop, MC Dropout, SGLD, the Local Reparametrization Trick, KF-Laplace, SG-HMC and more
MIT License
1.82k stars 303 forks source link

about the sampled output of bayes by backprop #4

Closed ShellingFord221 closed 4 years ago

ShellingFord221 commented 4 years ago

Hi, in BBB, we sample the outputs several times (5 for example) when testing. This is due to the nature of MC. But how do we use these 5 outputs of one sample? Averaging them? Summing them? Or just take the best one as this sample's output vector? Are there any theory about how to use these sampled results? Thanks!

stratisMarkou commented 4 years ago

Answer: For classification, we sample the weights of the whole network and compute the output given an input, for each of these samples. We then combine the predictions by averaging in probability space (not in logit-space). For regression we do the same, but now the output of the network is a mean and standard deviation - or other parameters if you're using another output distribution, we use a Gaussian. To combine these into a single gaussian, we average over means to get a single mean, and for the standard deviation we compute Total variance = Variance of predicted means + Mean of predicted variances. This is called the law of total variance.

Extra: In BBP, similar to any other MC-based algorithm for BNNs we sample the posterior over weights p(w | data), and then plug these samples into the likelihood p(y | x, w), to get an MC estimate of the posterior predictive p(y | x, data) = \int p(y | x, w) p(w | data) dw, where x, y is a new input-label pair. We need to do MC because the integral is otherwise intractable. The proper way to combine these samples is to follow the MC recipe, which is: \int p(y | x, w) p(w | data) dw = 1/N \sum_{n = 1}^N p(y | x, w_n), where w_n are samples from p(w | data). For classification this clearly means averaging over probability space. For regression the sum will be a mixture of gaussians, as a function of y*. In our case, we choose to simplify this to a single gaussian using the law of total variance.

Theory pointers

  1. Total variance: https://en.wikipedia.org/wiki/Law_of_total_variance
  2. Uncertainty decomposition: https://arxiv.org/pdf/1710.07283.pdf
  3. General MC theory: https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf

Hope this helps and thanks for your question!

ShellingFord221 commented 4 years ago

Hi, besides the sampled output of the network, the loss and kl of each sample is also the average over 10 sampled times' loss and kl, am I correct?

ShellingFord221 commented 4 years ago

Also in BayesByBackprop_MNIST_GMM.ipynb, should out first be sqeezed into Softmax then averaged before counting wrong predictions when training (you average the probability when testing in sample_eval() by mean_out = F.softmax(out, dim=2).mean(dim=0, keepdim=False))? Thanks! image

JavierAntoran commented 4 years ago

Hi,

Hi, besides the sampled output of the network, the loss and kl of each sample is also the average over 10 sampled times' loss and kl, am I correct?

Yes, that is the case. Note that a larger number of samples will provide a lower variance estimator of the ELBO.

ShellingFord221 commented 4 years ago

Hi, I have a doubt about the measurement of epistemic uncertainty and aleatoric uncertainty. We can decompose predictive uncertainty (i.e. the entropy of softmax distribution of a classification task) into uncertainty of test input (i.e. AL uncertainty) and uncertainty of model weights (i.e. EP uncertainty). But I found that when calculating uncertainties in this way, the posterior distribution is mostly approximated by Bernoulli (i.e. Variational Dropout). I wonder that if I use BBB to perform inference (i.e. the posterior distribution is approximated by Gaussian), can I also calculate uncertainties in this way?

Besides, I also find another way to calculate AL and EP uncertainties Uncertainty quantification using Bayesian neural networks in classification: Application to ischemic stroke lesion segmentation (https://openreview.net/pdf?id=Sk_P2Q9sG ), which models AL and EP uncertainties in a variance way rather than an entropy way. Due to my limited knowledge, I couldn't tell the difference (and connection) between these two ways of uncertainties. If we can have a discussion about this, I will be very thankful!