Closed xukai92 closed 3 years ago
@willtebbutt @xukai92
how about we always drop all observe statements in gdemo()()
, ie, we always sample all variables from the prior, even for the following case raised by @willtebbutt (slightly modified)
@model demo(y) = begin
5 ~ Normal(0,1) # rv1
y ~ Normal(0,1) . # rv2
end
gdemo()() # sample both `rv1` and `rv2` from the prior, ingoring `rv1=5`
gdemo(2)() # sample from the posterior
This leads to consistent behaviour and should be intuitive too.
I see. But what I mean by this issue is to actually have some concrete functions generated by evaluating gdemo()
(I think @willtebbutt also proposed this some time ago).
Currently behaviour:
@model demo(y) = begin
5 ~ Normal(0,1) # rv1
y ~ Normal(0,1) . # rv2
end
mf = gdemo()
mf() # => sampling from prior
But instead we could have
@model demo(y) = begin
5 ~ Normal(0,1) # rv1
y ~ Normal(0,1) . # rv2
end
mf = gdemo(???) # this returns mf::UniqueRuntimeModelType
# and create some function definitions:
logpdf(::UniqueRuntimeModelType, vi) = ... # (1) return logpdf by evaluating using vi
rand(::UniqueRuntimeModelType) = ... # (2) gdemo()() in your example
sample(::UniqueRuntimeModelType) = ... # (3) gdemo(2)() in your example
I guess it could also potentially give more space for performance because during sample we are explicitly using the specific version of logpdf
and rand
.
@yebai your proposal has the issue that the behaviour of the following programme becomes ambiguous:
@model demo() = begin
5 ~ Normal(0, 1)
end
demo() # what does this do?
This makes me nervous. On the one hand, this programme is weird, on the other it's a valid Turing programme, so we should ensure that the behaviour is well-defined. My proposal below will mostly resolve this.
@xukai92 I completely agree regarding the format of logpdf
. Regarding rand
and sample
: certainly we could define one to sample from the prior and the other the posterior, but this wouldn't be particularly intuitive (there's nothing in the names to suggest the proposed behaviour). Instead, I propose the following:
sample
, just use rand
and logpdf
(or maybe logp
, since logpdf
isn't defined for all models supported by Turing?).gdemo(missing)
to indicate that the prior should be used for y
seems sensible to me.gdemo(Prior())
to indicate that all random variables should be sampled from the prior. In this case, both rv1
and rv2
would be sample from the prior.rand(mf)
produces a VarInfo
. rand(mf, N)
produces a Vector{<:VarInfo}
of length N
.logpdf(mf, vi)
works as proposed by @xukai92.One the one hand, this programme is weird, on the other it's a valid Turing programme, so we should ensure that the behaviour is well-defined.
We can give an error if the model has no variables.
UniqueRuntimeModelType
This can be just const PriorModel{pvars, F} = Model{pvars, Tuple{}, F, NamedTuple{(), Tuple{}}}
which means no data, all parameters. For this case, no special sampling alg is required. More generally however, we can make use of the existing Sampler
type, which we can add model
(and vi
) to. Then we can simply do: rand(Sampler(model, alg))
or for models without parameters, it can be rand(Sampler(model))
.
Btw, after the new refactor, if you call gdemo()
, it will give you a PriorModel{pvars, F}
, with pvars
as Tuple{:y}
.
We can give an error if the model has no variables.
Agreed, but this programme definitely has a random variable. It's anonymous, but it's definitely there, and if we define sample-from-prior functionality, then presumably something should happen.
Btw, after the new refactor, if you call gdemo(), it will give you a PriorModel{pvars, F}, with pvars as Tuple{:y}.
This is probably fine for now, but I'm not sure it's appropriate long-term for the same reasons as above.
This can be just const PriorModel{pvars, F} = Model{pvars, Tuple{}, F, NamedTuple{(), Tuple{}}} which means no data, all parameters. For this case, no special sampling alg is required. More generally however, we can make use of the existing Sampler type, which we can add model (and vi) to. Then we can simply do: rand(Sampler(model, alg)) or for models without parameters, it can be rand(Sampler(model)).
I like this. We would have to think carefully about how the rand(Sampler(model, alg))
would work for MCMC though. We should also probably implement a few convenience functions as well i.e. rand(model) = rand(Sampler(model))
and similar.
It's anonymous, but it's definitely there, and if we define sample-from-prior functionality, then presumably something should happen.
Is there a practical application of sampling this anonymous random variable, or is it just for theoretical appeal? Note that implementation-wise, there is nothing random about a model with no explicit random variables; every time you run the model, you will get exactly the same thing, hence my difficulty digesting this anonymous random variable concept. I can see how 5
plays the role of data. So perhaps more generally this falls under likelihood models, where the only sensible thing to do with it, that I can think of, is to calculate the likelihood.
We should also probably implement a few convenience functions as well i.e. rand(model) = rand(Sampler(model)) and similar.
Sounds good.
Note that currently, we just ignore any numbers or arrays on the LHS of a ~
. If you want, we can assign an anonymous dvar
to it instead and assign the value on the LHS of the ~
to this anonymous dvar
. This would probably be more theoretically accurate. And it allows us to view the model with 5 ~ Normal(0, 1)
only as a LikelihoodModel
with 1 dvar
.
Edit: by ignore I mean that we don't treat it as an explicit variable, but it still affects the logp
calculation.
Is there a practical application of sampling this anonymous random variable, or is it just for theoretical appeal?
I'm just looking for consistency. I don't have a particular application in mind :)
Note that implementation-wise, there is nothing random about a model with no explicit random variables; every time you run the model, you will get exactly the same thing.
Not if you run the model in sample-from-prior mode.
If you want, we can assign an anonymous dvar to it instead and assign the value on the LHS of the ~ to this anonymous dvar. This would probably be more theoretically accurate. And it allows us to view the model with 5 ~ Normal(0, 1) only as a LikelihoodModel with 1 dvar.
This general kind of thing would be a good change. If the model is constructed in the regular way, it should be a dvar
as you suggest, but if the model is run in sample-from-prior mode, then it should presumably magically become a dvar
.
Not if you run the model in sample-from-prior mode.
@model demo() = begin
5 ~ Normal(0,1)
end
model = demo()
logp = model(Turing.VarInfo(), Turing.SampleFromPrior())
all(isequal(logp), (model(Turing.VarInfo(), Turing.SampleFromPrior()) for i in 1:100))
vi
is empty in every case, with a logp
only.
then it should presumably magically become a dvar.
This should be totally possible when constructing the Sampler
.
Sorry, I should have been clearer. I was referring to what I think should be going on rather than what actually happens at the minute. You're totally correct regarding the current behaviour. It's my feeling that if we implement a sampler-from-prior mode, then we should take these anonymous random variables more seriously and treat them more like named random-variables.
then we should take these anonymous random variables more seriously and treat them more like named random-variables.
I don't foresee a problem handling these. I can try a proof-of-concept after the VarInfo
refactor. Perhaps @trappmartin can also help with this after he reads the new compiler code.
Thanks for cc‘ing me. I’m pretty much in favour of the vi = rand(Sampler(model))
and the logpdf(model, vi)
concept. Though, I believe it would be good if we keep the sample
function. It’s semantically more clear to me what this does and that I can expect it to generate samples using a MCMC sampler. I would not expect it to draw from the prior as rand.
However, I’m not sure how inference with VI would fit into those syntaxes to stay consistent without having too many entry points for the user.
However, I’m not sure how inference with VI would fit into those syntaxes to stay consistent without having too many entry points for the user.
My thinking here is that VI works fine with the rand(sampler(model, alg), N)
design, where the alg
parameter would be a representation of the approximate posterior distribution. Semantically this reads as: draw N samples from the posterior distribution over the random variables in model
(each of which should be a VarInfo?) using alg
. This works for both VI and MCMC, so provides a single point of entry for both kinds of samplers.
Though, I believe it would be good if we keep the sample function. It’s semantically more clear to me what this does and that I can expect it to generate samples using a MCMC sampler.
Again, my feeling here is that we wouldn't need the sample
function, as the creation of the sampler, e.g. rand(sampler(model, HMC(...)), N)
, makes it quite clear that we're drawing N
samples from model
using the specified HMC
algorithm. Although I could live with the sample
function being an alias for rand
, I suspect that rand
is more consistent with existing stuff.
I’m pretty much in favour of the vi = rand(Sampler(model)) and the logpdf(model, vi) concept.
As am I :)
A minor stylistic thing, @yebai what would your thoughts be on making the "correct" thing to do be sampler(model, alg)
rather than Sampler(model, alg)
? (At least to me) the latter says make me an object of type Sampler
whereas the former says make me a thing that can generate samples from model
using alg
, which may or may not be an object of type Sampler
. The distinction is important because the Sampler
option commits us to always producing an object of type Sampler
if we want to produce samples, whereas the former allows us to spit out whatever object we like provided that it can be used to generate samples via rand
.
Reading through your explanation of sample
v.s. Sampler
, I feel sample
is better, although in most of the cases (at this moment), sample
will return a Sampler
to us. Also I think using a function rather than a type conversion syntax is a more standard Julia style.
Here is a summary of some proposals in this issue (biased towards personal aesthetic)
logp(m::Model, v::VarInfo, s::Selector=undef)
computes log density/mass for model m
at value v::VarInfo
Sampler(m, alg)
creates a sampler
for model
using alg
as the sampling method rand(s::Sampler(m, alg), N)
produces a Vector{<:VarInfo} of length N.
rand(s::Sampler(m, SampleFromPrior()), N)
produces a Vector{<:VarInfo} of length N, by sampling from the prior distribution while ignoring all observations (including something like 5~Normal(0,1)
). sample(m, alg; N=N, args...)
provides an alias to rand(Sampler(model, alg), N, args...)
v = rand(m::Model, N=1)
is an alias to v = rand(Sampler(m, SampleFromPrior()), N)
This would provide a relatively well-defined interface between the compiler and inference methods (e.g. variational methods and sampling methods). For variational methods, the most important API is the logp
function and its gradients. While for sampling methods, both the logp
function and the rand
function are necessary. This would allow us to clean up the code in many sampler's implementations. Such as examples below:
runmodel!(m::Model, v::VarInfo, spl::AbstractSampler)
==> logp(m::Model, v::VarInfo)
https://github.com/TuringLang/Turing.jl/blob/6d48b96b03617f4410fd90becd7fb3a6c0d7e7e8/src/core/VarReplay.jl#L106
v = model(v::VarInfo, s::AbstractSampler)
==> v= rand(Sampler(m, alg), 1)
https://github.com/TuringLang/Turing.jl/blob/bedbb5a10d3da794957d64677f0fcfa9542ac830/src/inference/hmc.jl#L133
Introduce an intermediate algorithm type: AbstractRunner
type SampleFromPrior <: AbstractRunner end
, ComputeLogDensity
, SampleFromUniform
, ParticleFiltering
assume
, observe
functions for Runner
typesA long term goal is to build up a clear separation between the compiler and inference methods (see https://github.com/TuringLang/Turing.jl/issues/456). Such that inference methods can be implemented in separate packages with a dependency on Turing.
Anything missing?
v = rand(m::Model, N=1) is an alias to v = rand(Sampler(m, SampleFromPrior()), N)
How about enabling this alias with support to sampling algorithm as well.
v = rand(m::Model, N=1; alg=SampleFromPrior()) = rand(Sampler(m, alg), N)
How about AbstractRunner
-> AbstractTilde
I don't know that we actually need a supertype here. I mean, is there going to be a fallback that can be applied to Abstract<Thing>
? If not, we can just create concrete types as required and avoid the hassle of having to choose and name for an abstract type, and forcing things to inherit from it.
Let's stick with AbstractRunner
type for now and change it to something else when we got better ideas.
I suggest during the model compilation we generate 2 functions inside Model
:
a ~ dist
lines with a ~ Fixed()
where the assume
of Fixed
returns the variable value in vi
and 0 for the logpdf. This is useful for maximum likelihood estimation.@mohamed82008 I think this is a good idea. Thanks! I'll take a closer look into this asap.
@yebai I had some time to read through your proposal. It sound quite reasonable to me. However, I have a few questions / remarks.
Selector
in the logp
function? Do we need that?In favour of what Mohamed suggested and the latest discussions on model criticism, it might be good to have:
logjoint(m::Model, v::VarInfo, s::Selector=undef)
== logp
logpdf(m::Model, v::VarInfo, s::Selector=undef)
logppd(m::Model, v::Vector{VarInfo}, s::Selector=undef)
As proposed by Mohamed we could do this by generating respective functions in the model macro.
Could someone please link the relevant discussions here?
@trappmartin could you please explain the above proposal in a bit more depth? What are the differences between each of the functions? What are logjoin
and logppd
short for? Thanks!
@mohamed82008 could you please expand on your proposal? It's not obvious to me why we need a separate model construction to perform maximum likelihood / when we would actually be interested in performing maximum likelihood in our context: if the user has gone to the trouble to specify priors over parameters, why would they wish to ignore them?
logjoint
= log p(\theta, x)
, i.e. the log of the joint probability function of the model.
logppd
= \sum_i^N log 1/T \sum_t^T p(x_i | \theta_t)
, i.e. the log pointwise predictive density function.
I don't think we want maximum likelihood. But I agree it would be good to provide an interface to compute the log likelihood.
Sorry, I'm not familiar with the logppd
, could you explain why it's useful?
logjoin = log p(\theta, x)
, i.e. the log of the joint probability function of the model.
If we're going to rename logp
to something, could we please write logjoint
? For the cost of an extra character it's quite a bit less ambiguous haha
Uh, sorry that was a typo. I meant logjoint
. :D
The log pointwise predictive density is useful for model criticism and model selection.
Uh, sorry that was a typo. I meant logjoint. :D
haha fair enough. Glad we're on the same page with this.
The log pointwise predictive density is useful for model criticism and model selection.
Is there a good reference for this, or is this something that should be obvious?
See the paper by A. Gelman, J. Hwang, and A. Vehtari on Understanding predictive information criteria for Bayesian models for example.
@mohamed82008 could you please expand on your proposal? It's not obvious to me why we need a separate model construction to perform maximum likelihood / when we would actually be interested in performing maximum likelihood in our context: if the user has gone to the trouble to specify priors over parameters, why would they wish to ignore them?
I just thought it would be cool to compare ML, MAP, VI and MCMC without changing the model definition! But @trappmartin has a lot more interesting use cases :)
What is the purpose of the Selector in the
logp
function? Do we need that?
@trappmartin
you're right - Selector
here is not necessary. This might be taken from some interface discussed in https://github.com/TuringLang/Turing.jl/issues/634#issuecomment-44965726, i.e.
# s::Sampler ==> s::Selector
logp = model(Turing.VarInfo(), s::Sampler)
Since logp
will always compute the log joint probability in the new design, we don't really need s::Selector
to select subsets of random variables.
I suggest during the model compilation we generate 2 functions inside Model: One for the log likelihood which replaces all the a ~ dist lines with a ~ Fixed() where the assume of Fixed returns the variable value in vi and 0 for the logpdf. This is useful for maximum likelihood estimation. The second is the logpdf of the joint probability distribution that we have now. This is to be used for MAP and inference. I just thought it would be cool to compare ML, MAP, VI and MCMC without changing the model definition! But @trappmartin has a lot more interesting use cases :)
@mohamed82008 I'm not sure I understand the point here. For MAP and VI based learning, all we need is logp
and its gradient. We don't need to modify the compiler.
logjoint = log p(\theta, x), i.e. the log of the joint probability function of the model. logppd= \sum_i^N log 1/T \sum_t^T p(x_i | \theta_t), i.e. the log pointwise predictive density function. See the paper by A. Gelman, J. Hwang, and A. Vehtari on Understanding predictive information criteria for Bayesian models for example.
@trappmartin
It seems logjoint
is the same as logp
?
I agree it's helpful to have some support for model evaluation and criticism, but I prefer to keep the standard API between compiler and inference minimal. These additional features can be supported in a separate module (e.g. MCMCChains
). In fact, it appears that logppd
can be implemented in the form of a new Runner
or even InferenceAlgorithm
.
It seems
logjoint
is the same aslogo
?
Yes, seems more telling to me than logp
I agree it's helpful to have some support for model evaluation and critism, but I prefer to keep the standard API between compiler and inference minimal. These additional features in a seprate module. In fact, it apears that
logppd
can be implemented in the form of a newRunner
or evenInferenceAlgorithm
.
Good point. I'll try to do so as it seems relevant for several people.
Is there anything here that has not already implemented in the meantime, or couldn't be moved to the discussions at AbstractPPL?
Nothing that I know of. Please feel free to close, and/or move some useful bits to AbstractPPL
.
See https://github.com/TuringLang/Turing.jl/issues/634#issuecomment-471339521 for a summary.