Closed jeffwillette closed 4 years ago
Hi @jeffwillette and thanks for the question. The reason why we add this log-stddev part is because we need to account for data normalisation: we want to evaluate the model's log-likelihood on the raw data, not on the normalised data. Here's a more detailed explanation:
We could just model the raw (unnormalised) data. In this case, the log-likelihood of the model parameters for the unnormalised data will be just the output of get_loss_and_rmse
, and that would be the end of the story.
But we want to normalise the data to facilitate training. We choose to normalise each component independently using:
x_train, y_train = data[train_index, :in_dim], data[train_index, in_dim:]
x_means, x_stds = x_train.mean(axis = 0), x_train.var(axis = 0)**0.5
y_means, y_stds = y_train.mean(axis = 0), y_train.var(axis = 0)**0.5
x_train = (x_train - x_means)/x_stds
y_train = (y_train - y_means)/y_stds
To evaluate the log-likelihood on the raw data we should: (1) normalise the test inputs - using the means and standard deviations of the training data; (2) pass the (normalised) test inputs through the network to get predictions (in normalised y-space); (3) transform the predictions (predicted means and standard deviations) back to unnormalised space and use them to evaluate the log-likelihood.
As a design choice, we decided to have the outputs of the model be in normalised (and not in unnormalised) space. If you look at the formula for a normal distribution, you can see that if we transform both the predictions and the data, the term in the exponent will remain unchanged, whereas the normalising constant at the front will change by an appropriate multiplicative factor, which when you take the log, turns up as the np.log(y_stds)
term that's added on.
I hope this explanation helps. This part of the code is poorly written and not at all documented (I'm responsible for this crime), and it's far from obvious what this term is doing there. I hope we'll soon get the bandwidth to fix this and other issues with this code. Thanks for pointing it out!
Thanks for the explanation. I knew this must have been the reason but I had to do the math on paper before I could believe it and I had never seen this calculation before. Thanks for the detailed explanation
One more question...
After I figured out that your method was correct I did a calculation using logsumexp as they did in the original repository and with your method of taking the means of the means and then (aleatoric + epistemic) uncertainty over the batch dimension and calculating the log probability of the batch...
I found the answers to be close but not equal. Do you have any insight into which one is "more" correct?
Could you give a snippet (or link) of the code you are using and the (numerical) answers you are getting?
The code I am using in my training loop is as follows (this was for a single run on the UCI housing dataset.
with torch.no_grad():
squared_err = 0.0
total_ll = 0.0
aleatoric = 0.0
epistemic = 0.0
total_d_ll = 0.0
for i, (x, y) in enumerate(test):
x, y = x.to(device), y.to(device)
mus = torch.zeros(args.samples, x.size(0), device=device)
logvars = torch.zeros(args.samples, x.size(0), device=device)
for j in range(args.samples):
mus[j], logvars[j], _ = model(x)
ll = Normal(mus, torch.exp(logvars / 2)).log_prob(y)
total_ll += torch.logsumexp(ll.sum(dim=1), dim=0).item()
mean = mus.mean(dim=0)
std = (
mus.var(dim=0) + torch.exp(logvars / 2).mean(dim=0) ** 2
) ** 0.5
total_d_ll += Normal(mean, std).log_prob(y).sum()
epistemic += mus.var(dim=0).mean().item()
aleatoric += torch.exp(logvars).mean(dim=0).sum().item()
real_y = y * test.dataset.y_sigma + test.dataset.y_mu # type: ignore
real_mu = mus.mean(dim=0) * test.dataset.y_sigma + test.dataset.y_mu # type: ignore
squared_err += ((real_y - real_mu) ** 2).sum().item()
aleatoric /= len(test.dataset)
epistemic /= len(test.dataset)
total_ll = total_ll - np.log(args.samples)
total_ll /= len(test.dataset)
total_ll -= np.log(test.dataset.y_sigma.item()) # type: ignore
total_d_ll /= len(test.dataset)
total_d_ll -= np.log(test.dataset.y_sigma.item()) # type: ignore
print(f"total_ll: {total_ll} total_d_ll: {total_d_ll}")
outputs:
total_ll: -2.2789058327527476 total_d_ll: -2.5311999320983887
So if I haven't made a mistake, these both give different answers
Hi. Great Repository! I was having trouble recreating the results from some of the papers which you implemented and I found your examples to be more complete than the original author repositories which is very helpful.
I have a question regarding your calculation of the test set log likelihood of the UCI datasets...
In this file (https://github.com/JavierAntoran/Bayesian-Neural-Networks/blob/master/notebooks/regression/mc_dropout_hetero.ipynb) you use the flow...
Then as a last step you add
log(y_sigma)
which is the standard deviation of the y values before they were normalized. Why do you do this and where does the necessity to do this come from?The original MC Dropout and Concrete Dropout repositories use some form of the
logsumexp
trick do calculate the log probabilities, but so far I have been unable to recreate the UCI results using their method. I get much closer with your method but I cannot justify the last step to myself.Reference
Concrete Dropout
MC Dropout