blei-lab / edward

A probabilistic programming language in TensorFlow. Deep generative models, variational inference.
http://edwardlib.org
Other
4.83k stars 759 forks source link

think about what it means to "default" to reparameterization gradient #38

Closed akucukelbir closed 8 years ago

akucukelbir commented 8 years ago

we currently default to the reparameterization gradient if the Variational class implements reparam

however, if the Inference class does not support reparameterization gradients (e.g. KLpq) then it doesn't matter whether the Variational class implements it or not.

dustinvtran commented 8 years ago

@mariru is working on MAP, which is another case where we don't necessarily need this score vs reparam dichotomy. We also need to think about how the class should later incorporate sampling methods (e.g., do we just treat is an "optimization"?).

akucukelbir commented 8 years ago

how about having a hierarchical method structure, like in Stan?

dustinvtran commented 8 years ago

You mean for specifying the inference method? E.g., Inference(method="MFVI")?

akucukelbir commented 8 years ago

hmm. now that i think about it, i'm not sure.

perhaps we have some sort of added hierarchy within Inference.

i don't know how to communicate this so bear with me:

 +-----------+
 | Inference +------------+------------------+
 +-----+-----+            |                  |
       |                  |                  |
       |                  |                  |
+------+------+    +------+-------+    +-----+------+
| Variational |    | Optimization |    |  Sampling  |
+------+------+    +--------------+    +------------+
       |
       |
       +
 MFVI/KLpq/etc.

so the reparam/score loss stuff happens at the variational level (in its implementation of run). perhaps Inference doesn't even need to implement run anymore.

does that make sense?

dustinvtran commented 8 years ago

I like the ASCII! This makes sense. I would also put optimization inside variational.

akucukelbir commented 8 years ago

optimization with the score function estimator? is that useful?

On Sun, Mar 13, 2016 at 12:10 PM, Dustin Tran notifications@github.com wrote:

I like the ASCII! This makes sense. I would also put optimization inside variational.

— Reply to this email directly or view it on GitHub https://github.com/Blei-Lab/blackbox/issues/38#issuecomment-195988217.

dustinvtran commented 8 years ago

For example, MAP (and by extension, MLE) is variational inference with a point mass variational family. This is how Maja is currently implementing it.

akucukelbir commented 8 years ago

what does sampling from a point mass mean?

the way i view it: variational inference in this library is basically (by choice) based on stochastic optimization techniques.

MAP and MLE does not need to be based on stochastic optimization. so doesn't it make more sense to separate?

(i could be missing something here.)

On Sun, Mar 13, 2016 at 2:40 PM, Dustin Tran notifications@github.com wrote:

For example, MAP (and by extension, MLE) is variational inference with a point mass variational family. This is how Maja is currently implementing it.

— Reply to this email directly or view it on GitHub https://github.com/Blei-Lab/blackbox/issues/38#issuecomment-196017335.

mariru commented 8 years ago

"sampling" for the point mass means simply returning its value.

If you checkout branch feature/map I implemented a variational family PMGaussian for modeling unconstrained parameters using a point estimate. It should probably get a better name. But I wanted to make the distinction that like MFGaussian the transform for the mean parameter is the identity.

So I think it can be useful to have run() in the variational/optimization parent class but then have methods within run() that get overwritten by the child classes: e.g. call build_loss() within run() in the parent class and then overwrite build_loss() in the child class to call one of build_score_loss() or build_reparamloss() or build"other"_loss(). These method specific loss functions can be implemented in the parent class or if a modification is needed they can also be overwritten for a specific inference method.

dustinvtran commented 8 years ago

Yup that's a great idea. So right now, Inference would have build_loss(): which returns raise NotImplementedError(). Then MFVI would write build_loss() as an if-else chain and returns the score or reparam loss. For KLpq, it would just be a single loss because there is no reparameterization gradient. For MAP, it can just return log p(x,z).

akucukelbir commented 8 years ago

so what's the full spec here? and what would be the best way of making this change? (we should be considerate of stuff happening in other branches.)

dustinvtran commented 8 years ago
class Inference:
    def __init__(self, model, data):

class MonteCarlo(Inference):
    def __init__(self, *args, **kwargs):
        Inference.__init__(self, *args, **kwargs)

    # not sure what will go here

class VariationalInference(Inference):
    def __init__(self, model, variational, data):
        Inference.__init__(self, model, data)
        self.variational = variational

    def run():
    def initialize():
    def update():
    def build_loss():
    def print_progress():

class MFVI(VariationalInference):
    def __init__(self, *args, **kwargs):
        VariationalInference.__init__(self, *args, **kwargs)

    def build_loss():
        if ...:
            return build_score_loss()
        else:
            return build_reparam_loss()

    def build_score_loss():
    def build_reparam_loss():

class KLpq(VariationalInference):
    def __init__(self, *args, **kwargs):
        VariationalInference.__init__(self, *args, **kwargs)

    def build_loss():

class MAP(VariationalInference):
    def __init__(self, model, data):
        variational = PointMass(...)
        VariationalInference.__init__(self, model, variational,data)

    def build_loss():
dustinvtran commented 8 years ago

As for how to implement this, I suggest we do this broad refactor as early stage as possible to avoid incurring debt. So we write this in a branch and then individually deal with any merge conflicts to each branch once the pull request is made.

akucukelbir commented 8 years ago

very nice.

wouldn't it be more flexible to have

class MAP(Inference):

again, i'm not entirely following why we want to go with this PointMass approach. is it to reduce some reimplementation of some code somehow?

mariru commented 8 years ago

By doing variational inference with a pointmass, you are reusing the gradient descent routine from run() in (variational) inference. Plus you can use the PointMass objects to encode constraints in the parameter space but then still do the same optimization as defined in run() in the unconstrained space.

On Mon, Mar 14, 2016 at 2:06 PM Alp Kucukelbir notifications@github.com wrote:

very nice.

wouldn't it be more flexible to have

class MAP(Inference):

again, i'm not entirely following why we want to go with this PointMass approach. is it to reduce some reimplementation of some code somehow?

— Reply to this email directly or view it on GitHub https://github.com/Blei-Lab/blackbox/issues/38#issuecomment-196447038.

dustinvtran commented 8 years ago

Broadly, I see inference derived from two paradigms: optimization (variational inference) and sampling (Monte Carlo methods). The reason to include techniques such as MLE, MAP, MML, and MPO as part of the variational inference class is for two reasons:

  1. Conceptually. I personally view variational inference as an umbrella term for any posterior inference method that is formulated as an optimization problem. All these estimation techniques are crude approximate methods based on the mode. Viewing them as approximations justifies and makes clear the use case for other approximations, such as KL(p||q). (E.g., I don't think it's reasonable to distinguish between inference via approximate posterior means and inference via exact or approximate posterior modes.)
  2. Practically. All optimization-based methods share many defaults: the same optimization routine (e.g., learning rate, gradient descent method) using update(), print progress() of the iteration and loss function's value, initialize(), and a general wrapper of all these objects in run(). Any of these methods can overwrite one of the defaults or add onto it.
akucukelbir commented 8 years ago

hmm. not to be pedantic here, but i don't think i agree with either point. (also, I don't know what MPO is.)

  1. interpreting MLE, for instance, as a posterior inference method is confusing.
  2. why should all optimization-based methods share the same optimization routine? why would i want to do stochastic gradient ascent instead of conjugate gradient or BFGS if i have exact gradients of my log prob?

a broader point of 1 is i guess this: did we decide to frame blackbox as a Bayesian toolbox?

i also didn't follow some of maja's comments. perhaps this is easier to figure out over coffee :)

On Mon, Mar 14, 2016 at 2:33 PM, Dustin Tran notifications@github.com wrote:

Broadly, I see inference derived from two paradigms: optimization (variational inference) and sampling (Monte Carlo methods). The reason to include techniques such as MLE, MAP, MML, and MPO as part of the variational inference class is for two reasons:

  1. Conceptually. I personally view variational inference as an umbrella term for any posterior inference method that is formulated as an optimization problem. All these estimation techniques are crude approximate methods based on the mode. Viewing them as approximations justifies and makes clear the use case for other approximations, such as KL(p||q). (E.g., I don't think it's reasonable to distinguish between inference via approximate posterior means and inference via exact or approximate posterior modes.)
  2. Practically. All optimization-based methods share many defaults: the same optimization routine (e.g., learning rate, gradient descent method) using update(), print progress() of the iteration and loss function's value, initialize(), and a general wrapper of all these objects in run(). Any of these methods can overwrite one of the defaults or add onto it.

— Reply to this email directly or view it on GitHub https://github.com/Blei-Lab/blackbox/issues/38#issuecomment-196459617.

dustinvtran commented 8 years ago

Well, let's agree to disagree then. :)

MPO: marginal posterior optimization

All optimization methods default to gradient descent (data subsampling is optional). latent variable sampling is currently used, e.g., in MFVI and KLpq, but it's not a necessary distinction. for example, we ideally would have coordinate ascent MFVI if someone wrote down a exponential family graphical model with VIBES-like metadata. (@heywhoah and I are interested in this.)

akucukelbir commented 8 years ago

agree to disagree? what kind of strange proposal is that? :)

let's chat in person. i think i'm missing some things here. ( e.g. preferring coordinate ascent? much strangeness abound :) )

dustinvtran commented 8 years ago

I wrote it in the MAP branch. Here's what it looks like: https://github.com/Blei-Lab/blackbox/blob/af3f0528fd116be3dbcfc6d3871ac9119648abce/blackbox/inferences.py

akucukelbir commented 8 years ago

nice work! (i'm not saying that what you and maja propose won't work btw.)

okay, let's discuss today if you both ( @dustinvtran @mariru ) are around!