Open bluesky314 opened 5 years ago
In general its because you need to be able to sample from a distribution, the logistic_p
tensor / function is only an estimate of the mean, that is, its just a parameter. The likelihood / pmf on the other hand is calculated using this parameter, you can find it here, in the article they express it using conditionals but you can write it as an equation. To estimate the optimal parameters you need to calculate the log-likelihood which is just the sum of the logarithm of the likelihood over the data / batch, which is what the tf.reduce_sum(rv_observed.log_prob(D))
line is doing.
I believe the formula for the log likelihood of the bernoulli distribution is the same as the sigmoid cross entropy loss function, so looking at it from this perspective, this function is returning a loss, not the predictions.
Got it! So we just get the probability from the logistic function and now since we have to calculate the likelihood, we use the Bernoulli function to do so with this p. Thanks @cgarciae
Can you please explain your second comment about sigmoid cross entropy abit more. Yes, the likleyhood term is the same as the cross entropy without the 1/n and without the negative sign . However we do have a prior term which dictates, for each value of the parameter, how much the loss should be weighted. Which kind of influences the effect that this likleyhood will have on the update. I would be interested in hearing/discussing more about this.
In ch2 Logistic example, when we assume our data follows logistic, why then use Bernoulli after that? In previous examples we only had an assumption of the data distribution and calculated the likelyhood form that. But here why has Bernoulli suddenly appeared when we already get a probability from logistic?