Implementing IWAE and VIMCO

bayerj commented 7 years ago

Hi,

I am currently toying with the idea to implement multi sample Monte Carlo objectives such as the importance weighted auto encoder [1] and VIMCO [2] in Edward. For that purpose, I browsed the code a bit and wanted to give some feedback and gather some thoughts on my plan how to do it.

Currently it would require subclassing ed.VariationalInference and copy/pasting quite some functionality from ed.MFVI. I line out why this is the case below.

Terminology

First, I am not sure how you think about it, but it seems debatable to me whether both methods are variational inference cases at all. From a practical perspective this does not matter, maybe, but I am unaware of any work showing that the proposal distributions used in both works minimize some divergence to the true posterior. This is a completely philosophical issue of course. During implementation, I'd refrain from any terminology implying that.

Gradient estimation

VIMCO relies on a special estimation of the gradient. Currently, Edward seems to leverage Tensorflow's AD, which will not work with VIMCO. I had a look at the API of ed.MFVI and found that it should be straightforward to have an additional method build_loss_and_gradient, which is taken in favor of build_loss and subsequent if it exists. Some pseudocode:

if getattr(self, 'build_loss_and_gradient', None) is not None:
  l, dl = self.build_loss_and_gradient(...)
else:
  l = self.build_loss(...)
  dl = ...

optimizer = ...
optimizer.apply_gradients(dl)

Class hierarchy

Currently, the most code could be reused by subclassing ed.MFVI, but that is probably wrong from a terminology standpoint. Further, lots of code in that class is not specific to mean-field, such as the resolution of optimizers.

Still, I believe that a lot of code from that class will be shared: all of VI (by definition somewhat) will rely on optimisation, and optimisation comes quite an amount of book keeping code.

Currently, the class also internally organises two different ways of loss functions, i.e. based on reparameterisation and the score function.

I wonder if a pattern such as dependency injection might work better here: have an initialisation parameter "loss", which is an object only taking care of formulating the loss and possibly its gradients (see above).

class VariationalInference(Inference):

  def __init__(self, ..., loss):
    self.loss = loss
    ...

  def build_loss(self):
    return self.loss.build_loss(...)

  # Mostly optimisation related functionality down here

This might also become handy as soon as different losses have to be combined, i.e. once you want to use VIMCO for the discrete local latents, SGVB for the continuous local latents and Bayes by backprop for the continuous global latents.

My overview of the field is far too local to comment on any possible method out there to do VI. But my feeling is (also given the huge list of issue #129 :) that a sensible way of organising these things is essential. My experience in machine learning in general is that every abstraction is sooner or later challenged by a recent research paper. ;)

[1] https://arxiv.org/abs/1509.00519 [2] https://arxiv.org/abs/1602.06725

dustinvtran commented 7 years ago

Thanks for the comprehensive comments!

Terminology. Yup, that's right. Not everything can be placed into Box's loop of model, inference, criticism. Although Edward prescribes that workflow. IWAEs (or VAEs) are an example where they can be placed into this loop. We separate out what defines the model and what defines the inference (following Rezende et al. (2014)). VIMCO can be shoehorned into this, when VIMCO is applied specifically for posterior inference. But it doesn't fit this framework when you just want to optimize an expectation of a function.

Gradient estimation. As you state, it's generally useful to separate out how to take gradients and how to perform updates with an optimizer using those gradients. From my understanding, I think VIMCO could work without having to do this though. For example, you can see how the score function gradient is implemented, which applies stop_gradient() on a set of nodes not to take AD with.

Class hierarchy. Extracting out pieces of MFVI is much needed. As you state, it makes sense to have separate inference algorithms that solely do reparameterization gradients, score function gradients, etc.

Dependency injection is an interesting idea. The way Edward is currently formulated, you define individual inference algorithms, each for particular set of latent variables to infer. Then you manually handle this collection of individual inference algorithms (possibly alternating updates). The dependency injection sounds useful if you want to define a single inference algorithm based on a collection of them.

bayerj commented 7 years ago

For example, you can see how the score function gradient is implemented, which applies stop_gradient() on a set of nodes not to take AD with.

I am still unsure whether this is the best road, after all it feels a little like a hack. It does not nicely separate concerns. But maybe I am too picky here.

dustinvtran commented 7 years ago

No I think you're right. The stop_gradient() is currently used only to simplify the code. We should definitely take an approach like build_loss_and_gradient vs build_loss if inference aims to be very general.

blei-lab / edward