Open pwl opened 7 years ago
thanks for the bug report. i'm having trouble reproducing it. i ran
import edward as ed
ed.get_session()
ed.set_seed(42)
N=1000
data = ed.models.Normal(5.0,1.0).sample_n(N).eval()
sigma2 = ed.models.Gamma([1.0],[1.0])
mu = ed.models.Normal([0.0],[1.0])
x = ed.models.Normal(tf.tile(mu,[N]),tf.tile(sigma2,[N]))
qsigma2 = ed.models.Gamma([1.0],tf.nn.softplus(tf.Variable([1.0])))
qmu = ed.models.NormalWithSoftplusSigma(tf.Variable([0.5]),tf.Variable([2.3]))
inference = ed.KLqp({mu: qmu, sigma2: qsigma2}, data = {x: data})
inference.run(n_iter=2000, n_print=200, n_samples=10)
the output is
Iteration 1 [ 0%]: Loss = 670219.625
Iteration 200 [ 10%]: Loss = 3772.485
Iteration 400 [ 20%]: Loss = 34084.867
Iteration 600 [ 30%]: Loss = 3329.652
Iteration 800 [ 40%]: Loss = 3764.631
Iteration 1000 [ 50%]: Loss = 240398.750
Iteration 1200 [ 60%]: Loss = 6391.427
Iteration 1400 [ 70%]: Loss = 4067.620
Iteration 1600 [ 80%]: Loss = 3449.564
Iteration 1800 [ 90%]: Loss = 4151.013
Iteration 2000 [100%]: Loss = 4345.420
i don't think there are any current examples of it in use unfortunately.
Huh, that's weird, I'm getting
Iteration 1 [ 0%]: Loss = 1.738
Iteration 200 [ 10%]: Loss = -0.000
Iteration 400 [ 20%]: Loss = -0.000
Iteration 600 [ 30%]: Loss = -0.000
Iteration 800 [ 40%]: Loss = -0.000
Iteration 1000 [ 50%]: Loss = -0.000
Iteration 1200 [ 60%]: Loss = -0.000
Iteration 1400 [ 70%]: Loss = -0.000
Iteration 1600 [ 80%]: Loss = -0.000
Iteration 1800 [ 90%]: Loss = -0.000
Iteration 2000 [100%]: Loss = -0.000
from the exact same code snippet that you posted. Maybe I'll try running it on a different machine. How can I check the version of edward and tensorflow to be sure we're using the same software? Which version are you using?
EDIT: Sorry for being a little slow on that but I managed to get the version of the modules:
1.1.5
0.11.0
This issue seems to be fixed in edward 1.1.6
.
thanks for catching that. my prediction is that this fix in 1.1.6 (https://github.com/blei-lab/edward/pull/373) made it work.
It seems like the issue is still there, for example, using Gamma as a prior for sigma of a normal distribution reduces in wild variations of the loss function, including negative values. You can see the example here, the KLqp inference was unable to recreate the parameters of the Gamma distribution.
Maybe I'm doing something wrong, maybe this can not work as I expect it to.
KLqp uses a form of black box variational inference, leveraging the score function gradient. This is known to be difficult to work with Gamma distributions. The wild variations in the loss are a result of the high variance of the estimate. You can tone this down by using more than one sample, e.g., n_samples=10
.
This is known to be difficult to work with Gamma distributions.
Interesting, do you have a reference on that? What other distributions cause similar problems? What would be the way around this? In my other notebooks I had some luck replacing Gamma with a very narrowly peaked Normal distribution but mathematically this doesn't make much sense.
As for your proposed solution, I tried changing n_samples
to 10
, 100
and 1000
, only for n_samples=1000
KLqp inference showed some signs of convergence. Still, the result was still far away from the original set of parameters. But this cannot be the right approach here, the performance degrades significantly with increasing the number of samples.
only through research unfortunately. this is what motivated the blei lab's work on rejection sampling variational inference. it tries to better handle black box gradients for dirichlet's and gamma's.
in practice i think you'd just have to use a different variational approximation—for example, a log normal distribution—or different algorithm altogether. there are certain use cases where the algorithm linked above can be helpful (and it would be cool to experiment with! #379). but i don't imagine it'll be a replacement to solve all black box variational inference problems with gamma distributions.
Thanks for the references and the explanation. I switched the posterior to a LogNormal distribution (defined via TransformedDistribution
) and it works much better now. Should I close this issue now? From what you wrote it seems like fixing it won't be possible for KLqp inference.
Maybe it would be worth adding some information or a warning on the limitations of KLqp inference to the documentation?
that's an excellent suggestion. we should definitely add those warnings (keeping this issue open for that reason).
more generally, its important to describe warnings on research limitations/soft constraints for each algorithms. expected runtime, what sort of models we expect it to work well on, and what sort of models it will fail miserably at.
I'm Hiroki Tanaka, assistant researcher supervised by Prof. Taniguchi at Ritsumeikan University. We are trying to use Edward for Robotics. We are going to implement the VI for GMM, HMM, MCL.
We have a problem in inference of a precision of one-dimensional Gaussian distribution. Is it possible to solve the problem? Do you have any further information about the problem? We are trying to use logNormal instead of Gamma based on your suggestions.
What should we do for inference of the parameters of the multidimensional Gaussian distribution?
Thanks for asking Hiroki. Can you raise an issue on Github and provide a snippet detailing some difficulties you're having?
In general, I would recommend a fully factorized log-Normal distribution—at the least to get started.
Using
Exponential(x)
distribution should be equivalent to usingGamma(1,x)
, yet I'm getting different results when I use one or the other as the approximation of the posterior inKLqp
. For example running the following code snippet (and please excuse the crudeness of the test case, it makes little theoretical sense to useExponential
as the approximation here)gives me
which looks plausible (algorithms converges modulo some oscillations). Changing the approximation to
qsigma2 = ed.models.Gamma([1.0],tf.nn.softplus(tf.Variable([1.0])))
results in weird values for the loss functionUltimately I was aiming at using an
InverseGamma
as a posterior approximation but I am experiencing similar problems with it. Are there any working examples of usingInverseGamma
as an approximation that I could refer to to verify that it actually works as intended?