Open dustinvtran opened 7 years ago
Jogging my memory, this falls into what @matthewdhoffman and I call "stochastic graph optimization"—calling back to Theano's deterministic graph optimization. Namely, stochastic graph optimization boils down to defining a probabilistic program and performing inference. For example:
What this means for PPLs with computational graphs is that any sort of intelligent graph optimization should be designed such that graph manipulations and inference algorithms are tightly connected. Would love to explore how we can do this in Edward.
Those two cases of compositionality and non-bijective functions are where the challenges are going to be. Actually I hadn't thought about how that would need to be part of the inference algorithm, but you're absolutely right, since there will be few such cases for which a closed-form likelihood exists (it would be nice if there was some probabilistic equivalent of the chain rule, so we could have "auto-logprob" to our stochastic graph as there is autodiff to the deterministic graph).
However I suspect implementing the simpler cases, of bijective functions and addition of RVs will make the more complex cases easier to reason about.
__add__
The way I see it there are two broad options:
If each RandomVariable had a .characteristic()
method then we could just define a new random variable with the product of the characteristic functions, then use the bijection between the ChF and the DF to get the log probability (not super sure about sampling though...).
I also really like this method because the ChF of PointMass (and by extension tf.Tensor) variables is actually well defined (just exp(i*t*value)
). Perhaps thinking about our RVs in terms of ChFs could get us an inference-free stochastic graph...
These are 'easy' but I wonder how to implement them without overloading the ed namespace too much, the obvious answer is to define a bunch of functions as below, but where should the following hypothetical ed.softplus
live?
def softplus(x):
if isinstance(x, RandomVariable):
return TransformedDistribution(x, bijector=bijectors.Softplus())
else:
return tf.nn.softplus(x)
Actually, I feel compelled to leave another comment to express my enthusiasm about a characteristic function based approach. I'm going to get in touch with my old probability theory lecturer and do some reading.
For starters, it would be a great way to do affine transformations.
re:characteristic functions. It's a neat property for calculating characteristic functions. But not sure if it'll help us calculate log_prob
? Maybe it's useful for recognizing that the sum of random variables is an existing random variable class?
re:bijective transforms. Instead of wrapping TF ops, I was thinking of a more functional approach in that you apply something like ed.infer(y)
where y = tf.nn.softplus(x)
and ed.infer
returns its distribution with kwarg value=y
. ed.infer
would traverse the graph, recognizing y.op.type == 'Softplus'
and y.op.inputs[0]
is a normal random variable.
Let me do some more reading and get back to you re: characteristic functions.
My reason for leaning towards wrapping ops is so that the user has an api that feels like writing a regular tensorflow graph, but that that embeds distributions as a property of the graph. So instead of having to explicitly traverse the graph at runtime to find out the distribution of y, it's just there for a user or (more importantly) inference algorithm to make use of.
In particular, it makes for a more natural api if you want to plug data in for y into an inference algorithm.
x = Normal(0.0, 1.0)
y = ed.softplus(x)
data = {y: y_obs}
ed.MAP([x], data).run()
Even in the current format using TransformedDistribution
, tensorflow complains at runtime, since x is not a Distribution
in the graph, but the tensor representation.
Am I right to suggest that we could also implement a similar feature for comparisons in many cases? For example, comparison of an RV to a constant (eg X < a) could return a Bernoulli with log-probs set appropriately from the CDF, and double comparisons (a < X < b) a Categorical?
You're correct. Programmatically, it's implemented with operators such as __gt__
.
From @cshenton on https://github.com/blei-lab/edward/issues/759#issuecomment-333026192:
Great minds think alike. :) When we wrote the ICLR paper, I had this ultimate desire to automate inferring random variable classes given operations: any differentiable 1-1 function on continuous random variables returns a
TransformedDistribution
;Poisson(1.0) + Poisson(3.0)
returns aPoisson(4.0)
random variable;Normal(0.0, 1.0) * Normal(0.0, 1.0)
returnsChi2(1)
; etc. This would be a powerful showcase of computational graphs.This is hard in the general case as you'd need extensive symbolic algebra. However, a first step could be hard-coding the fundamental operations such as
_add_
.In practice, this is also cool because the API doesn't drastically change. You can think of the current returned
tf.Tensor
as being augmented with random variable methods. Fetching the output would still return the same samples, sampling would be just as fast, and importantly it would have new methods such as log prob and mean.