rejection sampling variational inference

dustinvtran commented 7 years ago

arxiv paper

looping in @naesseth

jumpynitro commented 7 years ago

this is going to be implemented?

dustinvtran commented 7 years ago

@naesseth is planning to although i think he's been busy of late (pinging him so he can give his own response)

naesseth commented 7 years ago

Hi, yes sadly I have become tied up with other things. But the plan is still to implement it in Edward, most likely before the camera-ready deadline for AISTATS.

jumpynitro commented 7 years ago

Oh great, that good news!, thank you.

naesseth commented 7 years ago

The camera-ready deadline for AISTATS was a bit earlier than anticipated so it will not be ready in time for that. I have, however, provided code that we used for the experiments in the paper here

cavaunpeu commented 7 years ago

Hey @dustinvtran. Is this still up for grabs? Once finished with a work project, I'd like to give it a shot.

dustinvtran commented 7 years ago

Yep.

cavaunpeu commented 6 years ago

Hey @dustinvtran. Getting ready to implement this.

It seems that the Reparameterization*KLqp classes compute an estimate of the gradient of the ELBO via MC-integration w.r.t. the variational parameter z, as opposed to $\mathcal{N}(\eta)$ as done in the paper. For instance, in build_reparam_loss_and_gradients.

For RSVI, we compute this expectation via integration w.r.t. the accepted variable (in the rejection-sampling step) $\epsilon$. Is this what I should strive for in the implementation? This would seem inconsistent with how you've implemented the above.

naesseth commented 6 years ago

@cavaunpeu I'm unfamiliar with the specifics of Edward, but I think it should be possible to do the same for RSVI. The transformation I propose for the Gamma special case in my paper is invertible. In fact I make use of that in my Python/autograd implementation here. The inverse of the transformation is given by the function calc_epsilon.

cavaunpeu commented 6 years ago

Hey @naesseth! What do you mean by "do the same"? Integrate w.r.t. epsilon as you do in the paper?

naesseth commented 6 years ago

@cavaunpeu you mentioned that in edward the expectation is computed wrt the latent variable z, whereas I in my paper focus on formulating the problem in epsilon. for the gamma rejection sampler reparameterization it is a straightforward change of variable from epsilon to z, so you could implement RSVI with expectations wrt z as well as epsilon.

cavaunpeu commented 6 years ago

@naesseth

Ah, got it. Thanks!

So, before I implement, I just want to make sure I understand everything clearly. Is the following accurate?

To optimize the variational parameters Θ of our model, we compute the gradient of the ELBO w.r.t. Θ, then update Θ via SGD.
The reparameterization trick gives a low-variance estimate of this gradient.
The ∇_Θ h(Ɛ; Θ) term therein — the gradient of our deterministic-mapping-function h through which we generate samples from our (reparameterized) variational distribution q — must by definition be differentiable w.r.t. Θ. For many variational distributions, like Dirichlet and Gamma, this is not the case.
The Dirichlet, Gamma, etc. objects in Edward could, in theory, have samplers with a mapping function h that is differentiable w.r.t. the variational parameters Θ. However, they probably don't.
As an aside, accept-reject samplers are the typical way to generate samples from Dirichlet, Gamma, and other distributions.
Bringing all of the above together, we can rewrite (the estimate of) the gradient of the ELBO as an expectation in terms of the accepted sample Ɛ. While the g_rep term in this estimate contains ∇_Θ h(Ɛ; Θ), we took care to use a deterministic-mapping function h that is indeed differentiable w.r.t. Θ in the rejection sampler.
Finally, as you note above, we could equivalently compute this expectation in terms of z ~ q(z; Θ) if h is invertible, as z = h(Ɛ; Θ). Then, it's easy: just follow Algorithm 2.

naesseth commented 6 years ago

@cavaunpeu sounds about right. the requirement for using reparameterization-type gradients is slightly more subtle, but differentiability is a sufficient condition. if you'd like to know more about these issues it basically has to do with under what circumstances can we interchange integration and differentiation.

cavaunpeu commented 6 years ago

Ah, cool! Could you provided a reference? I'm interested to learn more.

cavaunpeu commented 6 years ago

@naesseth

So, I'm going to go ahead and implement this in two ways:

Equation 7, for when h is not invertible (and a proposal distribution r is necessarily provided).
Equation 8, for when h is invertible (as in the case of the Marsaglia and Tsang Gamma distribution).

Please interject if this sounds wrong. Will continue to leave questions here. Thanks so much for the help thus far.

naesseth commented 6 years ago

@cavaunpeu Sounds good, I'll try to answer any follow up questions as soon as I can.

A reference can be found e.g. here, hower it seems planetmath is down at the moment.

blei-lab / edward

rejection sampling variational inference #379