CamDavidsonPilon / Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)
http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/
MIT License
26.51k stars 7.84k forks source link

Logistic Example #454

Open bluesky314 opened 5 years ago

bluesky314 commented 5 years ago

In ch2 Logistic example, when we assume our data follows logistic, why then use Bernoulli after that? In previous examples we only had an assumption of the data distribution and calculated the likelyhood form that. But here why has Bernoulli suddenly appeared when we already get a probability from logistic?

def challenger_joint_log_prob(D, temperature_, alpha, beta):
    """
    Joint log probability optimization function.

    Args:
      D: The Data from the challenger disaster representing presence or 
         absence of defect
      temperature_: The Data from the challenger disaster, specifically the temperature on 
         the days of the observation of the presence or absence of a defect
      alpha: one of the inputs of the HMC
      beta: one of the inputs of the HMC
    Returns: 
      Joint log probability optimization function.
    """
    rv_alpha = tfd.Normal(loc=0., scale=1000.)
    rv_beta = tfd.Normal(loc=0., scale=1000.)

    # make this into a logit
    logistic_p = 1.0/(1. + tf.exp(beta * tf.to_float(temperature_) + alpha)) 

    rv_observed = tfd.Bernoulli(probs=logistic_p) 

    return (
        rv_alpha.log_prob(alpha)
        + rv_beta.log_prob(beta)
        + tf.reduce_sum(rv_observed.log_prob(D))
    )
cgarciae commented 5 years ago

In general its because you need to be able to sample from a distribution, the logistic_p tensor / function is only an estimate of the mean, that is, its just a parameter. The likelihood / pmf on the other hand is calculated using this parameter, you can find it here, in the article they express it using conditionals but you can write it as an equation. To estimate the optimal parameters you need to calculate the log-likelihood which is just the sum of the logarithm of the likelihood over the data / batch, which is what the tf.reduce_sum(rv_observed.log_prob(D)) line is doing.

cgarciae commented 5 years ago

I believe the formula for the log likelihood of the bernoulli distribution is the same as the sigmoid cross entropy loss function, so looking at it from this perspective, this function is returning a loss, not the predictions.

bluesky314 commented 5 years ago

Got it! So we just get the probability from the logistic function and now since we have to calculate the likelihood, we use the Bernoulli function to do so with this p. Thanks @cgarciae

Can you please explain your second comment about sigmoid cross entropy abit more. Yes, the likleyhood term is the same as the cross entropy without the 1/n and without the negative sign . However we do have a prior term which dictates, for each value of the parameter, how much the loss should be weighted. Which kind of influences the effect that this likleyhood will have on the update. I would be interested in hearing/discussing more about this.