blei-lab / edward

A probabilistic programming language in TensorFlow. Deep generative models, variational inference.
http://edwardlib.org
Other
4.83k stars 759 forks source link

Using Gamma distribution to approximate a posterior #389

Open pwl opened 7 years ago

pwl commented 7 years ago

Using Exponential(x) distribution should be equivalent to using Gamma(1,x), yet I'm getting different results when I use one or the other as the approximation of the posterior in KLqp. For example running the following code snippet (and please excuse the crudeness of the test case, it makes little theoretical sense to use Exponential as the approximation here)

N=1000
data = ed.models.Normal(5.0,1.0).sample_n(N).eval()

sigma2 = ed.models.Gamma([1.0],[1.0])
mu = ed.models.Normal([0.0],[1.0])
x = ed.models.Normal(tf.tile(mu,[N]),tf.tile(sigma2,[N]))

qsigma2 = ed.models.Exponential(tf.nn.softplus(tf.Variable([1.0])))

qmu     = ed.models.NormalWithSoftplusSigma(tf.Variable([0.5]),tf.Variable([2.3]))

inference = ed.KLqp({mu: qmu, sigma2: qsigma2}, data = {x: data})
inference.run(n_iter=2000, n_print=200, n_samples=10)

gives me

Iteration    1 [  0%]: Loss = 1256261.000
Iteration  200 [ 10%]: Loss = 3044.943
Iteration  400 [ 20%]: Loss = 4020.647
Iteration  600 [ 30%]: Loss = 3809.843
Iteration  800 [ 40%]: Loss = 3955.718
Iteration 1000 [ 50%]: Loss = 3413.860
Iteration 1200 [ 60%]: Loss = 4242.256
Iteration 1400 [ 70%]: Loss = 3493.551
Iteration 1600 [ 80%]: Loss = 4002.923
Iteration 1800 [ 90%]: Loss = 4321.123
Iteration 2000 [100%]: Loss = 4139.286

which looks plausible (algorithms converges modulo some oscillations). Changing the approximation to qsigma2 = ed.models.Gamma([1.0],tf.nn.softplus(tf.Variable([1.0]))) results in weird values for the loss function

Iteration    1 [  0%]: Loss = 0.864
Iteration  200 [ 10%]: Loss = 0.000
Iteration  400 [ 20%]: Loss = -0.000
Iteration  600 [ 30%]: Loss = -0.000
Iteration  800 [ 40%]: Loss = -0.000
Iteration 1000 [ 50%]: Loss = -0.000
Iteration 1200 [ 60%]: Loss = -0.000
Iteration 1400 [ 70%]: Loss = -0.000
Iteration 1600 [ 80%]: Loss = -0.000
Iteration 1800 [ 90%]: Loss = -0.000
Iteration 2000 [100%]: Loss = -0.000

Ultimately I was aiming at using an InverseGamma as a posterior approximation but I am experiencing similar problems with it. Are there any working examples of using InverseGamma as an approximation that I could refer to to verify that it actually works as intended?

dustinvtran commented 7 years ago

thanks for the bug report. i'm having trouble reproducing it. i ran

import edward as ed
ed.get_session()
ed.set_seed(42)

N=1000
data = ed.models.Normal(5.0,1.0).sample_n(N).eval()

sigma2 = ed.models.Gamma([1.0],[1.0])
mu = ed.models.Normal([0.0],[1.0])
x = ed.models.Normal(tf.tile(mu,[N]),tf.tile(sigma2,[N]))

qsigma2 = ed.models.Gamma([1.0],tf.nn.softplus(tf.Variable([1.0])))

qmu     = ed.models.NormalWithSoftplusSigma(tf.Variable([0.5]),tf.Variable([2.3]))

inference = ed.KLqp({mu: qmu, sigma2: qsigma2}, data = {x: data})
inference.run(n_iter=2000, n_print=200, n_samples=10)

the output is

Iteration    1 [  0%]: Loss = 670219.625
Iteration  200 [ 10%]: Loss = 3772.485
Iteration  400 [ 20%]: Loss = 34084.867
Iteration  600 [ 30%]: Loss = 3329.652
Iteration  800 [ 40%]: Loss = 3764.631
Iteration 1000 [ 50%]: Loss = 240398.750
Iteration 1200 [ 60%]: Loss = 6391.427
Iteration 1400 [ 70%]: Loss = 4067.620
Iteration 1600 [ 80%]: Loss = 3449.564
Iteration 1800 [ 90%]: Loss = 4151.013
Iteration 2000 [100%]: Loss = 4345.420

i don't think there are any current examples of it in use unfortunately.

pwl commented 7 years ago

Huh, that's weird, I'm getting

Iteration    1 [  0%]: Loss = 1.738
Iteration  200 [ 10%]: Loss = -0.000
Iteration  400 [ 20%]: Loss = -0.000
Iteration  600 [ 30%]: Loss = -0.000
Iteration  800 [ 40%]: Loss = -0.000
Iteration 1000 [ 50%]: Loss = -0.000
Iteration 1200 [ 60%]: Loss = -0.000
Iteration 1400 [ 70%]: Loss = -0.000
Iteration 1600 [ 80%]: Loss = -0.000
Iteration 1800 [ 90%]: Loss = -0.000
Iteration 2000 [100%]: Loss = -0.000

from the exact same code snippet that you posted. Maybe I'll try running it on a different machine. How can I check the version of edward and tensorflow to be sure we're using the same software? Which version are you using?

EDIT: Sorry for being a little slow on that but I managed to get the version of the modules:

pwl commented 7 years ago

This issue seems to be fixed in edward 1.1.6.

dustinvtran commented 7 years ago

thanks for catching that. my prediction is that this fix in 1.1.6 (https://github.com/blei-lab/edward/pull/373) made it work.

pwl commented 7 years ago

It seems like the issue is still there, for example, using Gamma as a prior for sigma of a normal distribution reduces in wild variations of the loss function, including negative values. You can see the example here, the KLqp inference was unable to recreate the parameters of the Gamma distribution.

Maybe I'm doing something wrong, maybe this can not work as I expect it to.

dustinvtran commented 7 years ago

KLqp uses a form of black box variational inference, leveraging the score function gradient. This is known to be difficult to work with Gamma distributions. The wild variations in the loss are a result of the high variance of the estimate. You can tone this down by using more than one sample, e.g., n_samples=10.

pwl commented 7 years ago

This is known to be difficult to work with Gamma distributions.

Interesting, do you have a reference on that? What other distributions cause similar problems? What would be the way around this? In my other notebooks I had some luck replacing Gamma with a very narrowly peaked Normal distribution but mathematically this doesn't make much sense.

As for your proposed solution, I tried changing n_samples to 10, 100 and 1000, only for n_samples=1000 KLqp inference showed some signs of convergence. Still, the result was still far away from the original set of parameters. But this cannot be the right approach here, the performance degrades significantly with increasing the number of samples.

dustinvtran commented 7 years ago

only through research unfortunately. this is what motivated the blei lab's work on rejection sampling variational inference. it tries to better handle black box gradients for dirichlet's and gamma's.

in practice i think you'd just have to use a different variational approximation—for example, a log normal distribution—or different algorithm altogether. there are certain use cases where the algorithm linked above can be helpful (and it would be cool to experiment with! #379). but i don't imagine it'll be a replacement to solve all black box variational inference problems with gamma distributions.

pwl commented 7 years ago

Thanks for the references and the explanation. I switched the posterior to a LogNormal distribution (defined via TransformedDistribution) and it works much better now. Should I close this issue now? From what you wrote it seems like fixing it won't be possible for KLqp inference.

Maybe it would be worth adding some information or a warning on the limitations of KLqp inference to the documentation?

dustinvtran commented 7 years ago

that's an excellent suggestion. we should definitely add those warnings (keeping this issue open for that reason).

more generally, its important to describe warnings on research limitations/soft constraints for each algorithms. expected runtime, what sort of models we expect it to work well on, and what sort of models it will fail miserably at.

tanaka-hiroki1989 commented 6 years ago

I'm Hiroki Tanaka, assistant researcher supervised by Prof. Taniguchi at Ritsumeikan University. We are trying to use Edward for Robotics. We are going to implement the VI for GMM, HMM, MCL.

We have a problem in inference of a precision of one-dimensional Gaussian distribution. Is it possible to solve the problem? Do you have any further information about the problem? We are trying to use logNormal instead of Gamma based on your suggestions.

What should we do for inference of the parameters of the multidimensional Gaussian distribution?

dustinvtran commented 6 years ago

Thanks for asking Hiroki. Can you raise an issue on Github and provide a snippet detailing some difficulties you're having?

In general, I would recommend a fully factorized log-Normal distribution—at the least to get started.