Standardising APIs between compiler and inference methods

xukai92 commented 5 years ago

See https://github.com/TuringLang/Turing.jl/issues/634#issuecomment-471339521 for a summary.

yebai commented 5 years ago

@willtebbutt @xukai92 how about we always drop all observe statements in gdemo()(), ie, we always sample all variables from the prior, even for the following case raised by @willtebbutt (slightly modified)

@model demo(y) = begin
   5 ~ Normal(0,1)   # rv1
   y ~ Normal(0,1) . # rv2
end

gdemo()()  # sample both `rv1` and `rv2` from the prior, ingoring `rv1=5`
gdemo(2)() # sample from the posterior

This leads to consistent behaviour and should be intuitive too.

xukai92 commented 5 years ago

I see. But what I mean by this issue is to actually have some concrete functions generated by evaluating gdemo() (I think @willtebbutt also proposed this some time ago).

Currently behaviour:

@model demo(y) = begin
   5 ~ Normal(0,1)   # rv1
   y ~ Normal(0,1) . # rv2
end
mf = gdemo()
mf() # => sampling from prior

But instead we could have

@model demo(y) = begin
   5 ~ Normal(0,1)   # rv1
   y ~ Normal(0,1) . # rv2
end
mf = gdemo(???) # this returns mf::UniqueRuntimeModelType
                # and create some function definitions:
logpdf(::UniqueRuntimeModelType, vi) = ... # (1) return logpdf by evaluating using vi
rand(::UniqueRuntimeModelType) = ... # (2) gdemo()() in your example
sample(::UniqueRuntimeModelType) = ... # (3) gdemo(2)() in your example

I guess it could also potentially give more space for performance because during sample we are explicitly using the specific version of logpdf and rand.

willtebbutt commented 5 years ago

@yebai your proposal has the issue that the behaviour of the following programme becomes ambiguous:

@model demo() = begin
    5 ~ Normal(0, 1)
end
demo() # what does this do?

This makes me nervous. On the one hand, this programme is weird, on the other it's a valid Turing programme, so we should ensure that the behaviour is well-defined. My proposal below will mostly resolve this.

@xukai92 I completely agree regarding the format of logpdf. Regarding rand and sample: certainly we could define one to sample from the prior and the other the posterior, but this wouldn't be particularly intuitive (there's nothing in the names to suggest the proposed behaviour). Instead, I propose the following:

Don't define sample, just use rand and logpdf (or maybe logp, since logpdf isn't defined for all models supported by Turing?).
The current proposal regarding gdemo(missing) to indicate that the prior should be used for y seems sensible to me.
We implement gdemo(Prior()) to indicate that all random variables should be sampled from the prior. In this case, both rv1 and rv2 would be sample from the prior.
rand(mf) produces a VarInfo. rand(mf, N) produces a Vector{<:VarInfo} of length N.
logpdf(mf, vi) works as proposed by @xukai92.

mohamed82008 commented 5 years ago

One the one hand, this programme is weird, on the other it's a valid Turing programme, so we should ensure that the behaviour is well-defined.

We can give an error if the model has no variables.

UniqueRuntimeModelType

This can be just const PriorModel{pvars, F} = Model{pvars, Tuple{}, F, NamedTuple{(), Tuple{}}} which means no data, all parameters. For this case, no special sampling alg is required. More generally however, we can make use of the existing Sampler type, which we can add model (and vi) to. Then we can simply do: rand(Sampler(model, alg)) or for models without parameters, it can be rand(Sampler(model)).

Btw, after the new refactor, if you call gdemo(), it will give you a PriorModel{pvars, F}, with pvars as Tuple{:y}.

willtebbutt commented 5 years ago

We can give an error if the model has no variables.

Agreed, but this programme definitely has a random variable. It's anonymous, but it's definitely there, and if we define sample-from-prior functionality, then presumably something should happen.

Btw, after the new refactor, if you call gdemo(), it will give you a PriorModel{pvars, F}, with pvars as Tuple{:y}.

This is probably fine for now, but I'm not sure it's appropriate long-term for the same reasons as above.

This can be just const PriorModel{pvars, F} = Model{pvars, Tuple{}, F, NamedTuple{(), Tuple{}}} which means no data, all parameters. For this case, no special sampling alg is required. More generally however, we can make use of the existing Sampler type, which we can add model (and vi) to. Then we can simply do: rand(Sampler(model, alg)) or for models without parameters, it can be rand(Sampler(model)).

I like this. We would have to think carefully about how the rand(Sampler(model, alg)) would work for MCMC though. We should also probably implement a few convenience functions as well i.e. rand(model) = rand(Sampler(model)) and similar.

mohamed82008 commented 5 years ago

It's anonymous, but it's definitely there, and if we define sample-from-prior functionality, then presumably something should happen.

Is there a practical application of sampling this anonymous random variable, or is it just for theoretical appeal? Note that implementation-wise, there is nothing random about a model with no explicit random variables; every time you run the model, you will get exactly the same thing, hence my difficulty digesting this anonymous random variable concept. I can see how 5 plays the role of data. So perhaps more generally this falls under likelihood models, where the only sensible thing to do with it, that I can think of, is to calculate the likelihood.

We should also probably implement a few convenience functions as well i.e. rand(model) = rand(Sampler(model)) and similar.

Sounds good.

mohamed82008 commented 5 years ago

Note that currently, we just ignore any numbers or arrays on the LHS of a ~. If you want, we can assign an anonymous dvar to it instead and assign the value on the LHS of the ~ to this anonymous dvar. This would probably be more theoretically accurate. And it allows us to view the model with 5 ~ Normal(0, 1) only as a LikelihoodModel with 1 dvar.

Edit: by ignore I mean that we don't treat it as an explicit variable, but it still affects the logp calculation.

willtebbutt commented 5 years ago

Is there a practical application of sampling this anonymous random variable, or is it just for theoretical appeal?

I'm just looking for consistency. I don't have a particular application in mind :)

Note that implementation-wise, there is nothing random about a model with no explicit random variables; every time you run the model, you will get exactly the same thing.

Not if you run the model in sample-from-prior mode.

If you want, we can assign an anonymous dvar to it instead and assign the value on the LHS of the ~ to this anonymous dvar. This would probably be more theoretically accurate. And it allows us to view the model with 5 ~ Normal(0, 1) only as a LikelihoodModel with 1 dvar.

This general kind of thing would be a good change. If the model is constructed in the regular way, it should be a dvar as you suggest, but if the model is run in sample-from-prior mode, then it should presumably magically become a dvar.

mohamed82008 commented 5 years ago

Not if you run the model in sample-from-prior mode.

@model demo() = begin
   5 ~ Normal(0,1)
end
model = demo()
logp = model(Turing.VarInfo(), Turing.SampleFromPrior())
all(isequal(logp), (model(Turing.VarInfo(), Turing.SampleFromPrior()) for i in 1:100))

vi is empty in every case, with a logp only.

mohamed82008 commented 5 years ago

then it should presumably magically become a dvar.

This should be totally possible when constructing the Sampler.

willtebbutt commented 5 years ago

Sorry, I should have been clearer. I was referring to what I think should be going on rather than what actually happens at the minute. You're totally correct regarding the current behaviour. It's my feeling that if we implement a sampler-from-prior mode, then we should take these anonymous random variables more seriously and treat them more like named random-variables.

mohamed82008 commented 5 years ago

then we should take these anonymous random variables more seriously and treat them more like named random-variables.

I don't foresee a problem handling these. I can try a proof-of-concept after the VarInfo refactor. Perhaps @trappmartin can also help with this after he reads the new compiler code.

trappmartin commented 5 years ago

Thanks for cc‘ing me. I’m pretty much in favour of the vi = rand(Sampler(model)) and the logpdf(model, vi) concept. Though, I believe it would be good if we keep the sample function. It’s semantically more clear to me what this does and that I can expect it to generate samples using a MCMC sampler. I would not expect it to draw from the prior as rand.

However, I’m not sure how inference with VI would fit into those syntaxes to stay consistent without having too many entry points for the user.

willtebbutt commented 5 years ago

However, I’m not sure how inference with VI would fit into those syntaxes to stay consistent without having too many entry points for the user.

My thinking here is that VI works fine with the rand(sampler(model, alg), N) design, where the alg parameter would be a representation of the approximate posterior distribution. Semantically this reads as: draw N samples from the posterior distribution over the random variables in model (each of which should be a VarInfo?) using alg. This works for both VI and MCMC, so provides a single point of entry for both kinds of samplers.

Though, I believe it would be good if we keep the sample function. It’s semantically more clear to me what this does and that I can expect it to generate samples using a MCMC sampler.

Again, my feeling here is that we wouldn't need the sample function, as the creation of the sampler, e.g. rand(sampler(model, HMC(...)), N), makes it quite clear that we're drawing N samples from model using the specified HMC algorithm. Although I could live with the sample function being an alias for rand, I suspect that rand is more consistent with existing stuff.

I’m pretty much in favour of the vi = rand(Sampler(model)) and the logpdf(model, vi) concept.

As am I :)

A minor stylistic thing, @yebai what would your thoughts be on making the "correct" thing to do be sampler(model, alg) rather than Sampler(model, alg)? (At least to me) the latter says make me an object of type Sampler whereas the former says make me a thing that can generate samples from model using alg, which may or may not be an object of type Sampler. The distinction is important because the Sampler option commits us to always producing an object of type Sampler if we want to produce samples, whereas the former allows us to spit out whatever object we like provided that it can be used to generate samples via rand.

xukai92 commented 5 years ago

Reading through your explanation of sample v.s. Sampler, I feel sample is better, although in most of the cases (at this moment), sample will return a Sampler to us. Also I think using a function rather than a type conversion syntax is a more standard Julia style.

yebai commented 5 years ago

Here is a summary of some proposals in this issue (biased towards personal aesthetic)

logp(m::Model, v::VarInfo, s::Selector=undef) computes log density/mass for model m at value v::VarInfo
Sampler(m, alg) creates a sampler for model using alg as the sampling method
rand(s::Sampler(m, alg), N) produces a Vector{<:VarInfo} of length N.
- rand(s::Sampler(m, SampleFromPrior()), N) produces a Vector{<:VarInfo} of length N, by sampling from the prior distribution while ignoring all observations (including something like 5~Normal(0,1)).
- sample(m, alg; N=N, args...) provides an alias to rand(Sampler(model, alg), N, args...)
- v = rand(m::Model, N=1) is an alias to v = rand(Sampler(m, SampleFromPrior()), N)

This would provide a relatively well-defined interface between the compiler and inference methods (e.g. variational methods and sampling methods). For variational methods, the most important API is the logp function and its gradients. While for sampling methods, both the logp function and the rand function are necessary. This would allow us to clean up the code in many sampler's implementations. Such as examples below:

runmodel!(m::Model, v::VarInfo, spl::AbstractSampler) ==> logp(m::Model, v::VarInfo) https://github.com/TuringLang/Turing.jl/blob/6d48b96b03617f4410fd90becd7fb3a6c0d7e7e8/src/core/VarReplay.jl#L106
v = model(v::VarInfo, s::AbstractSampler) ==> v= rand(Sampler(m, alg), 1) https://github.com/TuringLang/Turing.jl/blob/bedbb5a10d3da794957d64677f0fcfa9542ac830/src/inference/hmc.jl#L133
Introduce an intermediate algorithm type: AbstractRunner
- type SampleFromPrior <: AbstractRunner end,
- similiarly, ComputeLogDensity, SampleFromUniform, ParticleFiltering
- define assume, observe functions for Runner types
- helpful to enable sharing code among (high-level) inference algorithms, e.g. HMC, Gibbs, VI. https://github.com/TuringLang/Turing.jl/blob/6d48b96b03617f4410fd90becd7fb3a6c0d7e7e8/src/Turing.jl#L65

A long term goal is to build up a clear separation between the compiler and inference methods (see https://github.com/TuringLang/Turing.jl/issues/456). Such that inference methods can be implemented in separate packages with a dependency on Turing.

Anything missing?

xukai92 commented 5 years ago

v = rand(m::Model, N=1) is an alias to v = rand(Sampler(m, SampleFromPrior()), N)

How about enabling this alias with support to sampling algorithm as well.

v = rand(m::Model, N=1; alg=SampleFromPrior()) = rand(Sampler(m, alg), N)

xukai92 commented 5 years ago

How about AbstractRunner -> AbstractTilde

willtebbutt commented 5 years ago

I don't know that we actually need a supertype here. I mean, is there going to be a fallback that can be applied to Abstract<Thing>? If not, we can just create concrete types as required and avoid the hassle of having to choose and name for an abstract type, and forcing things to inherit from it.

yebai commented 5 years ago

Let's stick with AbstractRunner type for now and change it to something else when we got better ideas.

mohamed82008 commented 5 years ago

I suggest during the model compilation we generate 2 functions inside Model:

One for the log likelihood which replaces all the a ~ dist lines with a ~ Fixed() where the assume of Fixed returns the variable value in vi and 0 for the logpdf. This is useful for maximum likelihood estimation.
The second is the logpdf of the joint probability distribution that we have now. This is to be used for MAP and inference.

trappmartin commented 5 years ago

@mohamed82008 I think this is a good idea. Thanks! I'll take a closer look into this asap.

trappmartin commented 5 years ago

@yebai I had some time to read through your proposal. It sound quite reasonable to me. However, I have a few questions / remarks.

What is the purpose of the Selector in the logp function? Do we need that?

In favour of what Mohamed suggested and the latest discussions on model criticism, it might be good to have:

logjoint(m::Model, v::VarInfo, s::Selector=undef) == logp
logpdf(m::Model, v::VarInfo, s::Selector=undef)
logppd(m::Model, v::Vector{VarInfo}, s::Selector=undef)

As proposed by Mohamed we could do this by generating respective functions in the model macro.

willtebbutt commented 5 years ago

Could someone please link the relevant discussions here?

@trappmartin could you please explain the above proposal in a bit more depth? What are the differences between each of the functions? What are logjoin and logppd short for? Thanks!

@mohamed82008 could you please expand on your proposal? It's not obvious to me why we need a separate model construction to perform maximum likelihood / when we would actually be interested in performing maximum likelihood in our context: if the user has gone to the trouble to specify priors over parameters, why would they wish to ignore them?

trappmartin commented 5 years ago

logjoint = log p(\theta, x), i.e. the log of the joint probability function of the model. logppd= \sum_i^N log 1/T \sum_t^T p(x_i | \theta_t), i.e. the log pointwise predictive density function.

trappmartin commented 5 years ago

I don't think we want maximum likelihood. But I agree it would be good to provide an interface to compute the log likelihood.

willtebbutt commented 5 years ago

Sorry, I'm not familiar with the logppd, could you explain why it's useful?

logjoin = log p(\theta, x), i.e. the log of the joint probability function of the model.

If we're going to rename logp to something, could we please write logjoint? For the cost of an extra character it's quite a bit less ambiguous haha

trappmartin commented 5 years ago

Uh, sorry that was a typo. I meant logjoint. :D The log pointwise predictive density is useful for model criticism and model selection.

willtebbutt commented 5 years ago

Uh, sorry that was a typo. I meant logjoint. :D

haha fair enough. Glad we're on the same page with this.

The log pointwise predictive density is useful for model criticism and model selection.

Is there a good reference for this, or is this something that should be obvious?

trappmartin commented 5 years ago

See the paper by A. Gelman, J. Hwang, and A. Vehtari on Understanding predictive information criteria for Bayesian models for example.

mohamed82008 commented 5 years ago

@mohamed82008 could you please expand on your proposal? It's not obvious to me why we need a separate model construction to perform maximum likelihood / when we would actually be interested in performing maximum likelihood in our context: if the user has gone to the trouble to specify priors over parameters, why would they wish to ignore them?

I just thought it would be cool to compare ML, MAP, VI and MCMC without changing the model definition! But @trappmartin has a lot more interesting use cases :)

yebai commented 5 years ago

What is the purpose of the Selector in the logp function? Do we need that?

@trappmartin

you're right - Selector here is not necessary. This might be taken from some interface discussed in https://github.com/TuringLang/Turing.jl/issues/634#issuecomment-44965726, i.e.

# s::Sampler ==> s::Selector
logp = model(Turing.VarInfo(), s::Sampler)

Since logp will always compute the log joint probability in the new design, we don't really need s::Selector to select subsets of random variables.

I suggest during the model compilation we generate 2 functions inside Model: One for the log likelihood which replaces all the a ~ dist lines with a ~ Fixed() where the assume of Fixed returns the variable value in vi and 0 for the logpdf. This is useful for maximum likelihood estimation. The second is the logpdf of the joint probability distribution that we have now. This is to be used for MAP and inference. I just thought it would be cool to compare ML, MAP, VI and MCMC without changing the model definition! But @trappmartin has a lot more interesting use cases :)

@mohamed82008 I'm not sure I understand the point here. For MAP and VI based learning, all we need is logp and its gradient. We don't need to modify the compiler.

logjoint = log p(\theta, x), i.e. the log of the joint probability function of the model. logppd= \sum_i^N log 1/T \sum_t^T p(x_i | \theta_t), i.e. the log pointwise predictive density function. See the paper by A. Gelman, J. Hwang, and A. Vehtari on Understanding predictive information criteria for Bayesian models for example.

@trappmartin

It seems logjoint is the same as logp?

I agree it's helpful to have some support for model evaluation and criticism, but I prefer to keep the standard API between compiler and inference minimal. These additional features can be supported in a separate module (e.g. MCMCChains). In fact, it appears that logppd can be implemented in the form of a new Runner or even InferenceAlgorithm.

trappmartin commented 5 years ago

It seems logjoint is the same as logo?

Yes, seems more telling to me than logp

I agree it's helpful to have some support for model evaluation and critism, but I prefer to keep the standard API between compiler and inference minimal. These additional features in a seprate module. In fact, it apears that logppd can be implemented in the form of a new Runner or even InferenceAlgorithm.

Good point. I'll try to do so as it seems relevant for several people.

phipsgabler commented 3 years ago

Is there anything here that has not already implemented in the meantime, or couldn't be moved to the discussions at AbstractPPL?

yebai commented 3 years ago

Nothing that I know of. Please feel free to close, and/or move some useful bits to AbstractPPL.

TuringLang / DynamicPPL.jl

Standardising APIs between compiler and inference methods #16