JavierAntoran / Bayesian-Neural-Networks

Pytorch implementations of Bayes By Backprop, MC Dropout, SGLD, the Local Reparametrization Trick, KF-Laplace, SG-HMC and more
MIT License
1.83k stars 302 forks source link

UCI log likelihood calculation #16

Closed jeffwillette closed 4 years ago

jeffwillette commented 4 years ago

Hi. Great Repository! I was having trouble recreating the results from some of the papers which you implemented and I found your examples to be more complete than the original author repositories which is very helpful.

I have a question regarding your calculation of the test set log likelihood of the UCI datasets...

In this file (https://github.com/JavierAntoran/Bayesian-Neural-Networks/blob/master/notebooks/regression/mc_dropout_hetero.ipynb) you use the flow...

- make tensor of (samples, mu) and (samples, log_sigma)
- use means.mean() as mu and (means.var() + mean(sigma) ** 2) ** 0.5 as sigma
- calculate sum of all log probability / data instances

Then as a last step you add log(y_sigma) which is the standard deviation of the y values before they were normalized. Why do you do this and where does the necessity to do this come from?

The original MC Dropout and Concrete Dropout repositories use some form of the logsumexp trick do calculate the log probabilities, but so far I have been unable to recreate the UCI results using their method. I get much closer with your method but I cannot justify the last step to myself.

Reference

Concrete Dropout

MC Dropout

stratisMarkou commented 4 years ago

Hi @jeffwillette and thanks for the question. The reason why we add this log-stddev part is because we need to account for data normalisation: we want to evaluate the model's log-likelihood on the raw data, not on the normalised data. Here's a more detailed explanation:

  1. We could just model the raw (unnormalised) data. In this case, the log-likelihood of the model parameters for the unnormalised data will be just the output of get_loss_and_rmse, and that would be the end of the story.

  2. But we want to normalise the data to facilitate training. We choose to normalise each component independently using:

        x_train, y_train = data[train_index, :in_dim], data[train_index, in_dim:]

        x_means, x_stds = x_train.mean(axis = 0), x_train.var(axis = 0)**0.5
        y_means, y_stds = y_train.mean(axis = 0), y_train.var(axis = 0)**0.5

        x_train = (x_train - x_means)/x_stds
        y_train = (y_train - y_means)/y_stds
  1. To evaluate the log-likelihood on the raw data we should: (1) normalise the test inputs - using the means and standard deviations of the training data; (2) pass the (normalised) test inputs through the network to get predictions (in normalised y-space); (3) transform the predictions (predicted means and standard deviations) back to unnormalised space and use them to evaluate the log-likelihood.

  2. As a design choice, we decided to have the outputs of the model be in normalised (and not in unnormalised) space. If you look at the formula for a normal distribution, you can see that if we transform both the predictions and the data, the term in the exponent will remain unchanged, whereas the normalising constant at the front will change by an appropriate multiplicative factor, which when you take the log, turns up as the np.log(y_stds) term that's added on.

I hope this explanation helps. This part of the code is poorly written and not at all documented (I'm responsible for this crime), and it's far from obvious what this term is doing there. I hope we'll soon get the bandwidth to fix this and other issues with this code. Thanks for pointing it out!

jeffwillette commented 4 years ago

Thanks for the explanation. I knew this must have been the reason but I had to do the math on paper before I could believe it and I had never seen this calculation before. Thanks for the detailed explanation

jeffwillette commented 4 years ago

One more question...

After I figured out that your method was correct I did a calculation using logsumexp as they did in the original repository and with your method of taking the means of the means and then (aleatoric + epistemic) uncertainty over the batch dimension and calculating the log probability of the batch...

I found the answers to be close but not equal. Do you have any insight into which one is "more" correct?

stratisMarkou commented 4 years ago

Could you give a snippet (or link) of the code you are using and the (numerical) answers you are getting?

jeffwillette commented 4 years ago

The code I am using in my training loop is as follows (this was for a single run on the UCI housing dataset.

            with torch.no_grad():
                squared_err = 0.0
                total_ll = 0.0
                aleatoric = 0.0
                epistemic = 0.0
                total_d_ll = 0.0
                for i, (x, y) in enumerate(test):
                    x, y = x.to(device), y.to(device)

                    mus = torch.zeros(args.samples, x.size(0), device=device)
                    logvars = torch.zeros(args.samples, x.size(0), device=device)
                    for j in range(args.samples):
                        mus[j], logvars[j], _ = model(x)

                    ll = Normal(mus, torch.exp(logvars / 2)).log_prob(y)
                    total_ll += torch.logsumexp(ll.sum(dim=1), dim=0).item()

                    mean = mus.mean(dim=0)
                    std = (
                        mus.var(dim=0) + torch.exp(logvars / 2).mean(dim=0) ** 2
                    ) ** 0.5
                    total_d_ll += Normal(mean, std).log_prob(y).sum()

                    epistemic += mus.var(dim=0).mean().item()
                    aleatoric += torch.exp(logvars).mean(dim=0).sum().item()

                    real_y = y * test.dataset.y_sigma + test.dataset.y_mu  # type: ignore
                    real_mu = mus.mean(dim=0) * test.dataset.y_sigma + test.dataset.y_mu  # type: ignore
                    squared_err += ((real_y - real_mu) ** 2).sum().item()

                aleatoric /= len(test.dataset)
                epistemic /= len(test.dataset)

                total_ll = total_ll - np.log(args.samples)
                total_ll /= len(test.dataset)
                total_ll -= np.log(test.dataset.y_sigma.item())  # type: ignore

                total_d_ll /= len(test.dataset)
                total_d_ll -= np.log(test.dataset.y_sigma.item())  # type: ignore
                print(f"total_ll: {total_ll} total_d_ll: {total_d_ll}")

outputs:

total_ll: -2.2789058327527476 total_d_ll: -2.5311999320983887

So if I haven't made a mistake, these both give different answers