Operations on random variables automatically returning random variables

dustinvtran commented 7 years ago

From @cshenton on https://github.com/blei-lab/edward/issues/759#issuecomment-333026192:

What are your thoughts about moving edward to a point where all operations have an associated jacobian (in the same way all tensorflow ops have an associated gradient). This would, of course, mean overloading things like __add__ to return TransformedDistribution instances or the like. Could variational inference algorithms make good use of this information?

Great minds think alike. :) When we wrote the ICLR paper, I had this ultimate desire to automate inferring random variable classes given operations: any differentiable 1-1 function on continuous random variables returns a TransformedDistribution;Poisson(1.0) + Poisson(3.0) returns a Poisson(4.0) random variable; Normal(0.0, 1.0) * Normal(0.0, 1.0) returns Chi2(1); etc. This would be a powerful showcase of computational graphs.

This is hard in the general case as you'd need extensive symbolic algebra. However, a first step could be hard-coding the fundamental operations such as _add_.

In practice, this is also cool because the API doesn't drastically change. You can think of the current returned tf.Tensor as being augmented with random variable methods. Fetching the output would still return the same samples, sampling would be just as fast, and importantly it would have new methods such as log prob and mean.

dustinvtran commented 7 years ago

Jogging my memory, this falls into what @matthewdhoffman and I call "stochastic graph optimization"—calling back to Theano's deterministic graph optimization. Namely, stochastic graph optimization boils down to defining a probabilistic program and performing inference. For example:

Fusing random variable ops in a graph is about inferring the marginal density of a random variable parameterized by priors. This requires a choice of posterior inference algorithm.
Determining the distribution of an non-bijective function on a random variable is about likelihood-free inference. LF inference might approximate the density with a tractable distribution.

What this means for PPLs with computational graphs is that any sort of intelligent graph optimization should be designed such that graph manipulations and inference algorithms are tightly connected. Would love to explore how we can do this in Edward.

cshenton commented 7 years ago

Those two cases of compositionality and non-bijective functions are where the challenges are going to be. Actually I hadn't thought about how that would need to be part of the inference algorithm, but you're absolutely right, since there will be few such cases for which a closed-form likelihood exists (it would be nice if there was some probabilistic equivalent of the chain rule, so we could have "auto-logprob" to our stochastic graph as there is autodiff to the deterministic graph).

However I suspect implementing the simpler cases, of bijective functions and addition of RVs will make the more complex cases easier to reason about.

overloading `add`

The way I see it there are two broad options:

conditional soup that covers known cases (like the chi2, poisson examples you highlight).
generalised version with characteristic functions.

If each RandomVariable had a .characteristic() method then we could just define a new random variable with the product of the characteristic functions, then use the bijection between the ChF and the DF to get the log probability (not super sure about sampling though...).

I also really like this method because the ChF of PointMass (and by extension tf.Tensor) variables is actually well defined (just exp(i*t*value)). Perhaps thinking about our RVs in terms of ChFs could get us an inference-free stochastic graph...

bijective transforms

These are 'easy' but I wonder how to implement them without overloading the ed namespace too much, the obvious answer is to define a bunch of functions as below, but where should the following hypothetical ed.softplus live?

def softplus(x):
  if isinstance(x, RandomVariable):
    return TransformedDistribution(x, bijector=bijectors.Softplus())
  else:
    return tf.nn.softplus(x)

cshenton commented 7 years ago

Actually, I feel compelled to leave another comment to express my enthusiasm about a characteristic function based approach. I'm going to get in touch with my old probability theory lecturer and do some reading.

For starters, it would be a great way to do affine transformations.

dustinvtran commented 7 years ago

re:characteristic functions. It's a neat property for calculating characteristic functions. But not sure if it'll help us calculate log_prob? Maybe it's useful for recognizing that the sum of random variables is an existing random variable class?

re:bijective transforms. Instead of wrapping TF ops, I was thinking of a more functional approach in that you apply something like ed.infer(y) where y = tf.nn.softplus(x) and ed.infer returns its distribution with kwarg value=y. ed.infer would traverse the graph, recognizing y.op.type == 'Softplus' and y.op.inputs[0] is a normal random variable.

cshenton commented 7 years ago

Let me do some more reading and get back to you re: characteristic functions.

My reason for leaning towards wrapping ops is so that the user has an api that feels like writing a regular tensorflow graph, but that that embeds distributions as a property of the graph. So instead of having to explicitly traverse the graph at runtime to find out the distribution of y, it's just there for a user or (more importantly) inference algorithm to make use of.

In particular, it makes for a more natural api if you want to plug data in for y into an inference algorithm.

x = Normal(0.0, 1.0)
y = ed.softplus(x)

data = {y: y_obs}
ed.MAP([x], data).run()

Even in the current format using TransformedDistribution, tensorflow complains at runtime, since x is not a Distribution in the graph, but the tensor representation.

N-McA commented 6 years ago

Am I right to suggest that we could also implement a similar feature for comparisons in many cases? For example, comparison of an RV to a constant (eg X < a) could return a Bernoulli with log-probs set appropriately from the CDF, and double comparisons (a < X < b) a Categorical?

dustinvtran commented 6 years ago

You're correct. Programmatically, it's implemented with operators such as __gt__.

blei-lab / edward