blei-lab / edward

A probabilistic programming language in TensorFlow. Deep generative models, variational inference.
http://edwardlib.org
Other
4.83k stars 760 forks source link

Prior draws in Your first Edward program #810

Open raulsoutelo opened 6 years ago

raulsoutelo commented 6 years ago

I am new to Edward.

The samples drawn from the prior in Your first Edward program (Jupyter notebook version) seem to be drawn from qW_0, qW_1, qb_0 and qb_1 instead of W_0, W_1, b_0 and b_1. Shouldn't it be the other way around? If I understood correctly, the latter are the priors and the former are the distributions intended to approximate the posterior. I don't see why the initial values of q() would have anything to do with the prior.

Thanks

russellizadi commented 6 years ago

Initial values of q() are exactly the priors and here W_0, W_1, b_0, and b_1 are ed.model objects used as the model parameters for the definition of network. Actually, during the inference as below, the initialized q() distributions will be taken as the prior: inference = ed.KLqp({W_0: qW_0, b_0: qb_0, W_1: qW_1, b_1: qb_1}, data={y: y_train})

dustinvtran commented 6 years ago

It's true that it's not technically the prior because it's not standard normally distributed. Rather, it's the variational distribution with parameters initialized randomly. Clarification of this in the notebook is welcome.

raulsoutelo commented 6 years ago

Thanks for the answers.

Sorry, I am not sure what you meant by 'it's not technically the prior because it's not standard normally distributed'.

I have reviewed the code (loss and gradients implemented in the build_reparam_entropy_loss_and_gradients function from edward/inferences/klqp.py) and, if I understood correctly, W_0, W_1, b_0 and b_1 are exactly the priors. They are used together with qW_0, qW_1, qb_0 and qb_1 in the kl_penalty term to calculate the kl divergence between the prior and the distribution q() used to approximate the posterior:

kl_penalty = tf.reduce_sum([ tf.reduce_sum(inference.kl_scaling.get(z, 1.0) * kl_divergence(qz, z)) for z, qz in six.iteritems(inference.latent_vars)])

Mentioning the use of the mean field approximation in the notebook would also be helpful.

I could modify both in case I am right.

dustinvtran commented 6 years ago

Looking purely at this line:

mus = tf.stack(
    [neural_network(x, qW_0.sample(), qW_1.sample(),
                    qb_0.sample(), qb_1.sample())
     for _ in range(10)])

This draws functions from the predictive distribution under samples from the variational distribution. Thus the first visualization is a figure given by the initialization of the variational distribution. That initialization is not fixed so that q starts as a standard normal.