Closed cscherrer closed 4 years ago
@DilumAluthge, from your explanations, I do think it is now mostly clear to me what you want. Thanks a lot for being constructive and helpful.
You would want Bayesian models to be able output an integration kind of predictive posterior across samples. That is, if in your model (in the simpler special case of everything being continuous), you have:
Distribution{typeof(theta)}
, estimated on the training settypeY
.you want to return the integrated predictive posterior
$\int p(y_1|X_1, theta) ... p(y_M|X_M, theta) p(\theta | traindata) dtheta$
of type Distribution{Array{typeY,1}}
, as a joint prediction for the $(Y_1,...,Y_M)$.
As you outlined above, this will not in general be a product of marginals.
Therefore, in terms of representation, this will in general be different from returning a vector with the distributions
$\int p(y_i|X_i, theta) p(\theta | traindata) dtheta$,
which would be of type Array{Distribution{typeY},1}
.
Where in all of the above, the integrals are evaluated by MCMC (or computed analytically, if conjugate), and the representation is as a distribution which has a pdf in the argument(s) $y_i$.
Can you confirm whether my understanding is correct?
@DilumAluthge, for the case that my understanding is correct:
I think asking for this is not a good idea, and relies on conflation of two "kinds" of distributions in the Bayesian framework:
Distribution{mytype}
that encode a Bayesian belief over the value of an object with type mytype
Distribution{mytype}
that encode a frequency distribution over an random object taking values in the type mytype
These are not the same, and should not be conflated. It is, however, a common conflation - since both have the same type. You can typically distinguish the two by considering that:
How this applies to problems in our discussion specifically:
Distribution{Distribution{typeY}}
.Distribution{Array{Distribution{typeY}},1}
.Hope this makes sense? Let me know if any questions.
you want to return the integrated predictive posterior $\int p(y_1|X_1, theta) ... p(y_M|X_M, theta) p(\theta | traindata) dtheta$ of type
Distribution{Array{typeY,1}}
, as a joint prediction for the $(Y_1,...,Y_M)$.As you outlined above, this will not in general be a product of marginals.
Therefore, in terms of representation, this will in general be different from returning a vector with the distributions $\int p(y_i|X_i, theta) p(\theta | traindata) dtheta$, which would be of type
Array{Distribution{typeY},1}
.Where in all of the above, the integrals are evaluated by MCMC (or computed analytically, if conjugate), and the representation is as a distribution which has a pdf in the argument(s) $y_i$.
Can you confirm whether my understanding is correct?
This is correct.
- that "true" posterior of the joint predictive distribution will, by the way, be a distribution over i.i.d. distributions.
- thus, when applying the heuristic to this "true" posterior, you are left with the vector of marginals, rather than with your construction
This is not relevant to my use case. I do not have the "true posterior", so I cannot do any construction that involves applying any heuristic to the "true posterior".
The only objects I have are the ones I described in my "silly model" example.
- converges to some posterior distribution, which may or may not be equal with the "true" posterior (given the "true" model).
Again, in my use case, I am making no assumptions about the "true model" or the "true posterior", so any construction that requires knowledge about the "true model" or "true posterior" is of no use to me.
I also want to point out that this new JointProbabilistic
model is not restricted to Bayesian models.
Basically, I want to be able to create supervised models for which the predict
method is allowed to output any object of type Distribution
.
In fact, when it comes to these JointProbabilistic
models, because they are not necessarily Bayesian models, they may not even have the notion of priors, likelihoods, posteriors, etc.
So let me make an even more general proposal that will cover all use cases.
JointProbabilistic
modelWe will add a new supervised model type named JointProbabilistic
defined as such:
abstract type JointProbabilistic <: Supervised end
The only two methods that a JointProbabilistic
is required to implement are fit
and predict
.
A JointProbabilistic
model must implement the fit
method as follows:
MLJModelInterface.fit(model:: JointProbabilisticExampleModel, verbosity::Int, X, y) -> fitresult, cache, report
A JointProbabilistic
model must implement the predict
method as follows:
MMI.predict(model::JointProbabilisticExampleModel, fitresult, Xnew) -> some_distribution::Distribution{T} where T
In other words, when you call predict
on a JointProbabilistic
model the output will be an object some_distribution
which is of type Distribution{T} where T
.
@DilumAluthge, for the case that my understanding is correct:
I think asking for this is not a good idea, and relies on conflation of two "kinds" of distributions in the Bayesian framework:
- belief distributions, i.e., objects of type
Distribution{mytype}
that encode a Bayesian belief over the value of an object with typemytype
- frequency distribution, i.e., objects of type
Distribution{mytype}
that encode a frequency distribution over an random object taking values in the typemytype
These are not the same, and should not be conflated. It is, however, a common conflation - since both have the same type. You can typically distinguish the two by considering that:
- belief distributions, in well-posed models (identifiable, not overparameterized), converge to a delta distribution in the limit of training data. For example, you would expect that your inference on the parameter $\theta$ becomes more and more certain as your training sample size increases. That is, the distribution with pdf $p(\theta | traindata)$ will converge to a delta at the "true" value of $\theta$. Intuitively, your "belief" approaches certainty.
- frequency distributions, such as likelihoods, or predictive posteriors, do not exhibit that behaviour. For example, the predictive posterior distribution (with pdf in variable y_i) $\int p(y_i|X_i, \theta) p(\theta | traindata) dtheta$ converges to some posterior distribution, which may or may not be equal with the "true" posterior (given the "true" model). The likelihood distribution (with conditional pdf in variables y,x,\theta) $\int p(y|x, \theta)$ is assumed as part of the model specification, and does not "converge" in this sense. Intuitively, it's an error of categories to even call these a "belief".
How this applies to problems in our discussion specifically:
- the "predictive posterior" obtained by the integral rule (integrate out $\theta$) is, as said, a frequency distribution, not a "proper" posterior belief distribution! It is a heuristic to obtain a predictive frequency distribution. The dependence between the sample components, in what looks like the "natural" generalization, is a side effect of conflating the heuristic to obtain a frequency predictor with the "true" posterior on the predictive distribution.
- the "true" (belief) posterior of the predictive distribution is a (belief) distribution over (frequency) distributions! Namely, it is the law of the random variable $p(y_i|X_i, \Theta)$, where $X_i$ is fixed (conditioned on), and $\Theta$ is distributed according to the parameter posterior, i.e., the distribution with pdf $p(\theta | traindata)$. This is a distribution over distributions! It has scitype
Distribution{Distribution{typeY}}
.- if taken over the entire test sample, you are looking at the distribution of the random variable $\prod_{i=1}^M p(y_i|X_i, \Theta),$ with $X_i$ fixed, and $\Theta$ as above. This has scitype
Distribution{Array{Distribution{typeY}},1}
.- that "true" posterior of the joint predictive distribution will, by the way, be a distribution over i.i.d. distributions.
- thus, when applying the heuristic to this "true" posterior, you are left with the vector of marginals, rather than with your construction
Hope this makes sense? Let me know if any questions.
I understand and appreciate that you think that having a supervised model type for which the predict
method returns an object of type Distribution{T} where T
is not a good idea.
However, I hope that you can understand that there are other people that would like to integrate their packages into MLJ that think that this is a good idea, and in fact is the only way to integrate their packages into MLJ.
So I hope that we can agree to disagree, and can still move forward with adding this new model type.
In particular, I think that this will be possible because this new model type JointProbabilistic
will have absolutely no affect or change on any existing code or models in the MLJ ecosystem.
I think the new proposal I just made is broad enough and general enough to cover all use cases. And again I want to highlight that it is intended to support any package author that wants to add supervised models to MLJ in which the output predictions are of type Distribution
. This new model will not be specific to Bayesian statistics in any way.
Because my new proposal is as broad and general as possible, I think that we can end the discussion about mathematics and Bayesian statistics.
This new proposal will cover any model that wants to return Distribution
s. Thus, we cannot make any mathematical assumptions about these models.
@ablaom @tlienart My new proposal is general enough that it also includes the section of the Adding Models for General Use page with the heading "Models that learn a probability distribution" and describes "Models that learn a probability distribution, or more generally a "sampler" object". Currently you special-case models that fit a distribution to the target y
given a void input feature X = nothing
. This is simply a special case of my new proposal. So we can actually unify the APIs into the most general case that anyone can use :)
Can you confirm whether my understanding is correct?
This is correct.
@DilumAluthge, good that we are on one page now.
However, I feel you have not carefully read my response, or appreciated its mathematical nature.
This is not relevant to my use case. I do not have the "true posterior", so I cannot do any construction that involves applying any heuristic to the "true posterior". The only objects I have are the ones I described in my "silly model" example.
It seems you misunderstand what I meant, or you misunderstand what is a "proper" Bayesian belief posterior. I mean it's a posterior that you can write using Bayes' rule without modification. Can you please make an effort to read carefully what I wrote, and explain it back to me, just like I did with your explanations until we agreed that we agree on the content? Just so we know we are both on the same page.
Short summary:
The "predictive posterior" that comes from the integral rule is not a "proper" Bayesian belief posterior in this sense. It is a predictive frequency distribution.
The conclusion of my discussion is that the "joint posterior" that you want is:
More generally, it does not make sense in the i.i.d. setting, for any model, to predict joint frequency distributions over test samples, since we already know the test samples are independent.
What I've also explained above why the "joint posterior" doesn't make sense from Bayesian perspective. You're simply computing the "wrong" integral in the sense of the reasoning, and that you get joint posteriors is an artefact of that, rather than what the posterior really is.
Because my new proposal is as broad and general as possible, I think that we can end the discussion about mathematics and Bayesian statistics.
I don´t see how the second part of the sentence would follow from the first part, even if I agreed that the first part were true.
I also don´t understand what your Distribution{T} where T
above means.
However, I hope that you can understand that there are other people that would like to integrate their packages into MLJ that think that this is a good idea, and in fact is the only way to integrate their packages into MLJ.
I don't think having a joint (over samples) return type makes sense. Further, if you have an empirical distribution that is possibly joint, e.g., from MCMC, it's very easy to compute marginals, so it is not a major integration impediment. Or, use the "right" formula for the predictive posterior in the first place, which leads you to the same outcome.
I also don´t understand what your
Distribution{T} where T
above means.
Distribution{T} where T
is a UnionAll type.
julia> using Distributions
julia> Distribution{T} where T
Distribution{T,S} where S<:ValueSupport where T
julia> typeof(Distribution{T} where T)
UnionAll
My point is that my new proposal accounts for any model that produces a Distribution
as output. This is in no way restricted to Bayesian models, so the specific discussion about Bayesian models is not relevant.
Actually, I'm further broadening my proposal. Instead of returning a Distribution
, we will allow the JointProbabilistic
model to have a predict
method that returns an object of type Distributions.Sampleable
.
Distribution{T} where T is a UnionAll type.
ah, thanks for clarifying. Makes sense.
I don't think having a joint (over samples) return type makes sense.
Again, I fully acknowledge that you don't think that makes sense. But other people have use cases for which they believe this makes sense.
If you don't need this particular feature, you do not need to use it. It will not have any changes on the existing Probabilistic
models.
Again, I fully acknowledge that you don't think that makes sense. But other people have use cases for which they believe this makes sense.
This I believe, though it might imply non-trivial work on the interface. And I'm slightly disappointed that you don't seem to want to make the effort to understand what I've been saying - but no one can force you, of course.
I think I've outlined the relevant arguments above, so in the end @ablaom may want to weigh them up.
tl;dr, I think you are using a "bad" formula for your Bayesian posterior and/or your algorithm that you want to interface, which makes you believe you want joints across samples (currently the one motivating use case). You further seem to be subtly conflating some pieces of Bayesian theory. Also, we already know the test data are i.i.d., so predictive ditributions that depend between samples do not make much sense in the general case either.
(currently the one motivating use case)
As I point out above, another use case that this covers is the case in which you are e.g. fitting a distribution to data by e.g. kernel density estimation. Currently, this is given as a special case here:
My new proposal covers this use case as well: https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Models-that-learn-a-probability-distribution-1
So, as I understand it, you agree that I have a "joint posterior" or "predictive posterior" or "predictive frequency distribution" or whatever you want to call it.
And, if I understand you correctly, you also agree that the components of this "joint posterior" are not marginally independent.
Is that all correct?
But then you make the argument that it is not mathematically correct to construct or correct this "joint posterior", is that correct?
It would help if you could provide some sources (textbook, lecture notes, monograph, journal article, etc.) that prove why this "joint posterior" is not a useful or correct mathematical object to return.
As I point out above, another use case that this covers is the case in which you are e.g. fitting a distribution to data by e.g. kernel density estimation. Currently, this is given as a special case here:
You are probably referring to conditional density estimation? I don't think it is accurate to claim this is another use case: CDE give you predictive distributions that may be dependent over variables, but they are independent over the samples.
The use case of JointProbabilistic
is going to be: any model in which the result of predict
is a distribution, i.e. an object of type Distributions.Sampleable
.
Consider the example here: https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Models-that-learn-a-probability-distribution-1
When you call yhat = predict(mach, nothing)
, there is absolutely no way that yhat
can be a vector of distributions. yhat
must be an object of type Distributions.Sampleable
.
Note that objects of type Distributions.Distribution
are subtypes of Distributions.Sampleable
.
julia> import Distributions
julia> Distributions.Distribution <: Distributions.Sampleable
true
I've opened several pull requests.
Since there are multiple pull requests across multiple repositories, I have opened the following meta-issue to keep track of all of the pull requests: https://github.com/alan-turing-institute/MLJ.jl/issues/633
julia> import Distributions
julia> y = rand(Distributions.Normal(1,2), 100)
100-element Vector{Float64}:
julia> yhat = Distributions.fit(Distributions.Normal, y)
Distributions.Normal{Float64}(μ=0.9995819568314163, σ=1.8659336378188145)
julia> typeof(yhat)
Distributions.Normal{Float64}
julia> yhat isa Distributions.Distribution
true
julia> yhat isa Distributions.Sampleable
true
julia> yhat isa AbstractVector
false
This is an example of a Supervised
model in which the yhat
is not a vector of distributions. The current Probabilistic
interface requires that predict
ouput a yhat
in which yhat
is a vector of distributions. So this example cannot be a Probabilistic
model. But, it will be able to be a JointProbabilistic
model, or whatever we end up calling it.
@fkiraly Do you agree with the following statement:
There exist supervised machine learning models for which the predict
model will return a yhat
object in which yhat
is of the type Distributions.Sampleable
.
So, as I understand it, you agree that I have a "joint posterior" or "predictive posterior" or "predictive frequency distribution" or whatever you want to call it.
This is an important point in my argument! As I said, there are two kinds of distributions: belief and frequency distributions. The problem arises from not keeping them conceptually apart. I am aware that some Bayesian schools only think there are just one "kind" of distribution, and everything is just belief (which, I believe, is not a conceptually coherent belief).
I don't agree with you fully:
But then you make the argument that it is not mathematically correct to construct or correct this "joint posterior", is that correct?
Yes, that is the type of the argument I make. Based on a certain (possibly narrow) definition of "Bayesian posterior", namely that it is a distribution which indicates the degree belief in the value of a variable, which should approach certainty in the data asymptotic limit.
It would help if you could provide some sources (textbook, lecture notes, monograph, journal article, etc.) that prove why this "joint posterior" is not a useful or correct mathematical object to return.
I wanted to cite Bernardo/Smith, Bayesian Theory, chatper 5.1.3 as a reference - though it appears I was mistaken in remembering the content, and indeed the joint predictive posterior is similar to what you propose it is constructed there. I'm slightly surprised about this, though Bernardo/Smith doesn't discuss prediction conditional on covariates, which also surprised me slightly. I'll look into this.
Bishop, by the way, provides predictive distributions for individual test points only - see e.g., section 3.2.2. I don't think the "joint" one appears in Bishop at all, does it?
@fkiraly Do you agree with the following statement:
There exist supervised machine learning models for which the predict model will return a yhat object in which yhat is of the type Distributions.Sampleable.
This is not a well-defined statement, because it is a matter of definition, depending on what you meant with "supervised machine learning model". Since it is not well-defined, I neither agree nor disagree, but think it's not well-defined.
You can of course define your supervised ML model in this way, but then I would contest that such definition is sensible, or the most useful one for an ML toolbox framework.
Perhaps a more useful discussion is: what would you do with an output of type Distributions.Sampleable
? How would you evaluate the utility of such an output?
This is not a well-defined statement, because it is a matter of definition, depending on what you meant with "supervised machine learning model". Since it is not well-defined, I neither agree nor disagree, but think it's not well-defined.
You can of course define your supervised ML model in this way, but then I would contest that such definition is sensible, or the most useful one for an ML toolbox framework.
Consider the specific example above, taken directly from the MLJ documentation, for a model that fits a distribution to data. For example, in this case, you provide a vector y
, and the model tries to fit a univariate normal distribution to the data, which it then returns as the output of predict
. The model is a subtype of Supervised
. Do you believe that this example is a machine learning model that is appropriate for MLJ?
Perhaps a more useful discussion is: what would you do with an output of type
Distributions.Sampleable
? How would you evaluate the utility of such an output?
There is no single answer to this question. The authors of such models will define appropriate performance evaluation metrics.
For example, for Soss models, I imagine that Chad and I will implement some performance evaluation metrics.
Consider the specific example above, taken directly from the MLJ documentation,
Can you provide a link to the example (above, where is it?), and to the MLJ docs, please?
There is no single answer to this question. The authors of such models will define appropriate performance evaluation metrics.
But this is an important question! You want X to be implemented. So, what are the most common and important things X is used for? What is the most common way to measure whether X was good? Pointers/examples would be helpful.
Saying "there are many things" is just as helpful as saying nothing here...
Consider the specific example above, taken directly from the MLJ documentation,
Can you provide a link to the example (above, where is it?), and to the MLJ docs, please?
I have provided this link multiple times in this pull request.
Also, what's a soss model? Genuienly unaware/curious. (lit ref please)
There is no single answer to this question. The authors of such models will define appropriate performance evaluation metrics.
But this is an important question! You want X to be implemented. So, what are the most common and important things X is used for? What is the most common way to measure whether X was good? Pointers/examples would be helpful.
Here is a concrete example. Suppose I am doing multiclass classification in Soss. An example performance metric is: expected value of the Brier score.
Soss is one of the probabilistic programming languages (PPLs) in Julia: https://github.com/cscherrer/Soss.jl
Thanks for the Bernardo/Smith recommendation.
If you look in section 5.1.6 (should start on page 263):
This "predictive density" is exactly the object that I want to return from the predict
method, in the example that you and I have been discussing.
As you said above, Bernardo and Smith endorse the construction of this object.
If you have the time, it would be great if you could find a source that explains why this construction is incorrect.
This "predictive density" is exactly the object that I want to return from the predict method, in the example that you and I have been discussing.
No, I feel this is likely a misunderstanding of yours, of the notation. In Bernardo/Smith, the x/y are not features/labels, as you might now think! Just because they are y-s and x-es does not mean it is the same as in the supervised learning setting. Instead, the x/y are what one could refer to as "training"/"test" set.
It is hence not identical with what you are looking for - you need a predictive density in the predictive case that's conditional on covariates. Which Bernardo/Smith, to my surprise, does not contain, as far as I could see - this is conditional on the "training set" only. The formulation is for a generative distribution (unconditional on covariates).
Thus, as far as I see, you cannot argue that the Bernardo/Smith book would advocate, or endorse, the kind of return type you want.
If you have the time, it would be great if you could find a source that explains why this construction is incorrect.
In science, the burden of proof is with the one making the positive claim - i.e., you need to prove that what you're doing is sensible. https://en.wikipedia.org/wiki/Hitchens%27s_razor https://en.wikipedia.org/wiki/Argument_from_ignorance
I see what you mean. So you are saying that Bernardo and Smith are constructing:
p(y_testing | y_training)
And your point is that I want to construct:
p(y_testing | y_training, x_training, x_testing)
Are you arguing that the construction of p(y_testing | y_training)
is valid but the construction of p(y_testing | y_training, x_training, x_testing)
by the same method is not valid?
Are you arguing that the construction of p(y_testing | y_training) is valid but the construction of p(y_testing | y_training, x_training, x_testing) by the same method is not valid?
No, I merely say that the reference does not contain a construction for p(y_testing | y_training, x_training, x_testing)
(and that this surprised me).
I'm also not sure whether the "construction for p(y_testing | y_training, x_training, x_testing)
by the same method" is identical with yours.
Further, on a minor note, one could also worry that a naive construction for p(y_testing | y_training, x_training, x_testing)
leaks information from parts of the test set to other parts of the test set - aren't all the other test features used then to fit the predictive method?
I think the situation is more like this:
The author of a popular Julia PPL (https://github.com/cscherrer/Soss.jl) would like to integrate his PPL library into MLJ. There are currently no PPLs integrated in MLJ. And, as far as I understand it, the authors of the other Julia PPLs do not have the time and energy to spend on integrating their PPLs into MLJ.
Additionally, the author is willing to take the lead on integrating his PPL into MLJ: see e.g. Chad's work in the https://github.com/tlienart/SossMLJ.jl and https://github.com/cscherrer/SossMLJ.jl repositories.
However, in order for him to do so, there will need to be a new feature added to MLJ, namely the ability to have supervised machine learning models for which the predict
method outputs objects of the type Distributions.Distribution
, or more generally of the type Distributions.Sampleable
.
How much effort are you and the other MLJ team members willing to spend helping Chad integrate Soss into MLJ?
At this point, I have spent more time on this discussion than I can justify. I apologize, but I cannot spend more time on this discussion.
Thank you to everyone that has been a part of this discussion, including but not limited to: @cscherrer, @azev77, @ablaom, @fkiraly, and @tlienart. (My apologies if I have inadvertently omitted anyone from this list!) I know everyone has put a lot of energy and effort into this discussion. I am very grateful for the time that everyone has spent commenting on this issue.
How much effort are you and the other MLJ team members willing to spend helping Chad integrate Soss into MLJ?
Me? 0.
But I'm not an active MLJ team member, so everyone is very welcome to ignore my ramblings and not consider my opinion in any way "official" for MLJ :-) Just interested in supervised probabilistic predictive models really (and I have been involved with desiging the proba interface).
How much effort are you and the other MLJ team members willing to spend helping Chad integrate Soss into MLJ?
I'd say: Chad should be open to work towards interface contracts by the MLJ team, instead of insisting on a substantial re-write that may affect other users - unless the MLJ team is onboard with a substantial re-write (in which case Chad may want to drive it).
I apologize, but I cannot spend more time on this discussion.
Sorry for that, I don't want this to keep you off the thread - I'll withdraw then and finish posting stuff, since I've already posted my opinions on this; the thread now also contains some references and clarifications too that will hopefully be useful to the MLJ team. Feel free to continue discussing here, all.
Also, we already know the test data are i.i.d., so predictive ditributions that depend between samples do not make much sense in the general case either.
You're not understanding. Here, maybe this will help. you...
Suppose we have an unknown parameter θ
, and just two observations, (x1,y1)
and (x2,y2)
. These are not i.i.d. (such models are just not interesting), but are exchangeable, i.e., they're conditionally independent, given θ
.
So we're given
P(θ)
P(y1|x1, θ)
P(y2|x2, θ)
(with the same functional form)
Then our goal is to compute
P(y2 | x1, y1, x2)
= ∫ P(y2, θ | x1, y1, x2) dθ
= ∫ P(y2 | θ, x1, y1, x2) P(θ | x1, y1, x2) dθ
= ∫ P(y2 | θ, x2) P(θ | x1, y1) dθ
This shows how the dependence on y values arises from a common dependence on θ. In particular, the typical way to sample from this is
P(y2 | θ, x2)
In science, the burden of proof is with the one making the positive claim - i.e., you need to prove that what you're doing is sensible.
This discussion has not been a technical challenge but a pedagogical one. What we've proposed has been well-established through a long history of active research.
But I'm not an active MLJ team member, so everyone is very welcome to ignore my ramblings and not consider my opinion in any way "official" for MLJ :-)
Works for me.
This discussion has not been a technical challenge but a pedagogical one.
You're being very impolite here... That's basically saying I'm stupid and childish. I tend to think that lines such as these reflect more on the writer than they have relevance for the reader.
Suppose we have an unknown parameter θ, and just two observations, (x1,y1) and (x2,y2). These are not i.i.d. (such models are just not interesting), but are exchangeable, i.e., they're conditionally independent, given θ.
PS: you're conflating the generative data model and the model based inferences.
But now I'm really gone, bye :-)
@fkiraly we are all on the same team! I think there were a lot of honest misunderstandings in this discussion.
At least that’s been true for me. Now I’m less confused then before, thanks to this discussion...
Yes, thanks to all for the discussion, which I have just caught up on now.
I shall reflect on this a little more before responding in the next few days.
Thank you everyone for discussion that has been mostly very patient. I can understand the frustration that arose of not having a joint black/white board.
I really appreciate all the comments that have been from @fkiraly who very careful think about probablistic prediction and @csherer who has created a very flexible PPL which from the start supported posterior predictives directly. I remember a discussion with @fkiraly about should the predictive distribution really be random measure as opposed to posterior averaging (this is particularly important for HMMs where marginally the prediction would look unimodal but as a distribution over trajectories it might be multimodel.)
!Potential! Misunderstandings I spoted are=
p(y_testing | y_training, x_training, x_testing)leaks information about the test set
p(theta |x_{I},y_{I})\propto p(x_{I},y_{I}|theta)p(theta)
Moving on a key question is reporting uncertainties and evaluations. I liked the discussion early on about measurements.jl:
Uncertainties and posterior exporting
Validation
Also we should look at how this is done elsewhere e.g. mlr3proba or @fkiraly skpro see e.g. https://github.com/alan-turing-institute/skpro/blob/master/skpro/vendors/pymc.py .
More later ;)
Thanks @vollmersj , ...
* p(y_testing | y_training, x_training, x_testing)leaks information about the test set
Can you give some more details here? It seems to me predicting y_testing | y_training, x_training, x_testing
is pretty universal in supervised learning, and the only distinction here is to represent this probabilistically. But maybe I'm missing something.
* the i.i.d. case there is a difference if (X,Y) are i.i.d and X and Y are each i.i.d. and independent of each other
I didn't see anyone suggesting the former. My point was that to be very precise about it, the Y
values in, say, a normal linear model are not i.i.d. If they were, you wouldn't need the X
s! Instead they're conditionally independent, given X
. That works for the frequentist case where the parameters are "fixed but unknown". In Bayesian analysis we also need to include the parameters in the conditional. This leads to dependence in the predictions, because we're never given the parameters.
for some new data there are two problems - Evaluating the prediction for new data $x_J,y_J$ jointly or marginally:
1. log marginal likelihood = log score $\sum_{j\in J} \int \log( p(x_{J},y_{j}|theta)) p(theta |x_{I},y_{I})d\theta$ 2. log marginal likelihood = log score $\sum_{j\in J} \log( py_{J}| x_{J}, x_{I},y_{I} )$ Effectively treating x at not random. Both is done in practice and in theory e.g. Ghosal, Ghosh et al results on posterior consistency for random and non-random covariates.
This is really interesting, as it's an entirely different sense of "joint" than I've been discussing. I've been taking X
as known. Then, even for univariate Y
, there are correlations between the Y
s.
Please let me know if this still isn't clear. I've tried to give detailed examples, but if you help me see what's not getting through I'm happy to dig in some more.
Uncertainties and posterior exporting
* Even if having access to a join posterior credible sets can have different forms (e.g. ellipsoids) * I really like Distributions.sample e.g. MLJ could query for more samples * do we capture mismatch _true posterior_ numerical approximation (is hard)
There's a wide range of possibilities here. I've been assuming the MLJ side here is an abstract type, does that sound right?
In some cases like a Gaussian, the distributional result can be exact. Otherwise, it will often take the form of a pair:
There's an example of this here: https://github.com/cscherrer/SossMLJ.jl/issues/7
Validation
* There are lots of metrics for evaluating probability distribution to data-generating process e.g. Wasserstein, KL, Stein, proper scoring rules (log-loss, Brier) * valuable point of the complexity of constructing valuable metrics over mixed spaces @fkirlay
These metrics are great in cases where we have a "true" distribution and would like to evaluate some approximation to it. I don't see that that's really the case here. Given the model assumptions, we can find a function to sample from the posterior predictive distribution. Where do you see an approximation coming into play? Or maybe you mean this would be for cases where we can compute the posterior but we choose to approximate it, maybe with variational inference or a more constrained model?
OTOH it would be very valuable to have some capability to do posterior predictive checks. I can do this from Soss, but maybe some of it needs to be from the MLJ side as well. Then again, this may be getting ahead of ourselves :)
For better readability of editable comments below:
Yes, @vollermsj, I certainly miss being in the same room with a blackboard!
Here is my initial response to the proposal as kindly detailed by @DilumAluthge. I'm sorry this does include some more technical discussion. Be assured I appreciate acutely your patience to endure these so far. I am being more verbose than you may like, but only to mitigate further possible misunderstandings.
At present, the only kind of probablistic supervised learning model that MLJ designed to interface is a model that:
(i) Assumes data $(X_1, y_1), (X_2, y_2), ... $ is generated by an i.i.d process; and
(ii) Is capable of delivering, after seeing training data $D$, a probability distribution $p (y | x, D)$, defined for each new single input observation $x$.
Given a probabilistic scoring rule (e.g., Brier score) the expected loss of the model is then well-defined. There a number of algorithms, such as cross-validation, implemented in MLJ (and all such ML toolboxes) that take the function $p$ as input and estimate this loss. While not without controversy, these estimates are ubiquitous and well-studied. Furthermore, both the definition of the expected loss, and the algorithms for estimating the loss do not depend on any other feature of the model (e.g., "model is Bayesian", or "model is linear"). It is therefore possible to compare all such models in a consistent way using such estimates, which is crucial.
Here are goals that have been articulated so far, as best as I can gather:
(i) We integrate into MLJ Bayesian models that fit into the framework outlined in 1.
(ii) Certain functionality of Bayesian models not shared by all models (but not unique to them) is exposed in MLJ. Specifically, "correlated predictions" (see 5. below) should be exposed.
(iii) New functionality is added to MLJ that would allow evaluation of
Bayesian models in ways that do not fit into the framework
outlined in 1, even though the models themselves may do so. This
goal requires (ii). (Here I'm thinking of things like
implementing brier_score(dist::Distribution{Vector{T}}) where T
, as discussed in @DilumAluthge's proposal.)
(iv) We additionally integrate Bayesian models that do not fit into the framework outlined in 1, such as models for non i.i.d. processes. Here "integrate" is not the best word, because currently MLJ has little to offer in the way of meta-algorithms to support such models. But the implication seems to be that realizing (iii) would change this (?)
In principle, I do not have objections to any one goal. However:
For me (and I expect most general MLJ users) (i) is the highest
priority. It seems to me (i) can be achieved independently of the
other goals. I would not support adding i.i.d. Bayes models to MLJ
that can be fit into framework 1 but do no actually implement the
necessary part of the API needed to include them. This is not to say
that i.i.d. models are only valuable as part of the framework, only
that integration into MLJ only makes sense if they participate. I
realize that to implement this goal it may be necessary to
generalize some measures so that they can deal with (vectors
of)Sampler
predictions, and not just Distriubtion
predictions,
which I would support.
It seems (ii) can be readily achieved somehow. However, I do see a flaw in the current proposal for doing so, in which goal (i) is compromised. See 6 below. I may also simply misunderstand the proposal.
I don't believe (iii) is a trivial undertaking. I therefore suggest these enhancements be added by a new third party package. Here are some reasons: (a) MLJBase is already large and the performance measures interface (which should be a package in its own right) is large, growing and a bit complicated; (b) Limited resources now mean it's hard to justify an enhancement that is neither small, simple, nor adding value across the board (to all models); (c ) Having this externalized might help you rally the necessary expertise and would give you independence (you wouldn't have to wait a week+ for every PR review from me).
Well, (iv) seems to depend on (iii). Realistically, it is probably out-of-scope for now.
Before responding to the specific design proposal, I think I need to clarify the relationship between the API specs and the framework defined in 1.
While the MLJ API specifies that each Probabilistic
model should
implement a predict(mach, Xnew)
method that returns a vector of
probability distributions [d1, d2, ..., dk]
for each
multi-observation input X
(a table with k
rows, say) it is tacitly
assumed that this method is equivalent to broadcasting a single
observation predict method, corresponding to the distribution $p$
above. In other words, predict
should just be an implementation of
the vector-valued function $P$, given by
$$P (y_1, y_2, ..., y_k | x_1, x_2, ..., x_k, D) = (p(y_1 |x_1, D), p(y_2 |x_2, D), ..., p(y_k | x_k, D)).$$
This assumption is necessary unless we agree to depart from the framework 1 (which would exclude us from comparing all models in a consistent way).
Let me note here a trivial corollary of our assumption: the single component of $P(y_1 | x_1, D)$ is the same thing as the first component of $P(y_1, y_2 | x_1, x_2, D)$, or in MLJ syntax:
predict(mach, Xnew[1, :])[1] == predict(mach, Xnew)[1, :]
for any table Xnew
with two rows. I will call this property
consistency below.
Distilling previous disucssions:
Given a family of probabilistic predictors $p_\theta$, parameterized by $\theta$ (each fitting into the framework of 1 above) and a mixing pdf $w(\theta)$ (possibly depending on the training data $D$) then we can construct a multivariate distribution function
$$ p_{corr}(y_1, ..., y_k | x_1, ..., x_k |D) = \int \prodi p\theta(y_i | x_i, D) d\theta $$
whose marginals are generally correlated. This framework includes Bayesian models, where $w(\theta)$ is the posterior for model hyperparameters.
Note that if we take the special case $k=1$ our multivariate distribution becomes a univariate one, and we obtain a candidate $p(y| x, D)$ for placing a mixture model into the framework 1 (I'm assuming the setting is i.i.d data). However, this does not appear to factor into @DilumAluthge's proposal.
To achieve goal 2(ii) @DilumAluthge is proposing that for a class of
models with the new subtype ProbablisiticJoint
, we should declare
that predict(mach, Xnew)
return a representation of correlated
predictions $p_{corr}$, evaluated on the rows x_1
, ..., x_k
of the
table Xnew
. (Actually, version 2 of the proposal just says this
needs to be probability distribution, but in that case there is no
suggestion as to how to fit the model into framework 1.)
As I understand it (and maybe I have this wrong) one then obtains the
"vector of distributions" required for fitting the model into
framework 1 by computing the marginals of p_corr
? If that is so, and
we call the result of this operation predict_marginals(mach, Xnew)
,
then it must be consistent, in the sense of 4. That is, we require
predict_marginal(mach, Xnew[1, :])[1] == predict_marginal(mach, Xnew)[1, :]
for any table Xnew
with two rows. This is evidently not the case. An
equal mixture of two binary classifiers already provides a counter
example.
I would be very surprised if there is any way to construct a
consistent "vector of distributions" from the correlated predictions
predict(mach, Xnew)
. Of course, the predict
function (as opposed
to a single evaluation) can be used to get this, as the last
observation in 5 shows. But the idea that we can convert a
ProbablisiticJoint
model into a regular Probabilistic
model by
simply composing it with a marginalization operation (or any
operation) would not appear to work, right?
Very happy to see a revised update to the proposal or have my
misunderstandings corrected. However, my own view is that this is not
the right approach. Since i.i.d Bayes models are expected to implement
framework 1, they must share all the behaviour of the existing
Probabilistic
models and so ought to have this type. Extra
functionality goes on top by adding methods, such as
predict_joint
. To flag those models that support the extra
functionality, we could introduce a subtype JointProbabilistic <: Probabilistic
or a trait. (Actually, the implemented_methods
trait
may already serve this purpose.)
I understand @DilumAluthge has already given this approach a lot of thought, which I appreciate:
I believe that this would lead to way too much code churn throughout the entire MLJ ecosystem. Additionally, it would require a lot of breaking changes in both the MLJ ecosystem as well as all Julia packages that currently implement Probabilistic MLJ models. I think that this would be quite disruptive and would require a lot of person-hours.
It's not clear to me is why adding functionality should be disruptive. Could you give an example?
Thanks @ablaom! I'll defer to @cscherrer on the technical specifics.
Broadly speaking, what I'm hearing is that implementing Bayesian models within the existing Probabilistic
framework will be preferable to adding a new JointProbabilistic
model type.
In that case, we will require Bayesian models (like all Probabilistic
models) to implement a predict
method that satisfies all of the the following criteria:
predict
returns an AbstractVector
of Distributions.Sampleable
spredict
method is equivalent to broadcasting a single observation predict methodpredict
meets the consistency property as defined in @ablaom's comment above.All of this is consistent with the current implementation of the Probabilistic
interface, except that we are expanding predict
to allow an AbstractVector
of Distributions.Sampleable
s instead of just an an AbstractVector
of Distributions.Distribution
s. This is based on the following snippet from @ablaom's comment:
I realize that to implement this goal it may be necessary to generalize some measures so that they can deal with (vectors of)
Sampler
predictions, and not justDistribution
predictions, which I would support.
Then, in addition, we allow Probabilistic
models to optionally implement a predict_joint
method. We can add a subtype JointProbabilistic <: Probabilistic
for these models.
Summary:
Probabilistic
models to have predict
methods that return AbstractVector{<:Distributions.Sampleable}
, instead of just AbstractVector{<:Distributions.Distributions}
.AbstractVector{<:Distributions.Sampleable}
as input, instead of just AbstractVector{<:Distributions.Distributions}
.JointProbabilistic <: Probabilistic
.JointProbabilistic
must implement the predict_joint
method. The result of predict_joint
must be of type Distributions.Sampleable
.Of course, this all requires PPLs (such as Soss.jl) to have a predict
method that returns AbstractVector{<:Distributions.Sampleable}
.
Number 3 and number 4 are implemented by: https://github.com/alan-turing-institute/MLJModelInterface.jl/pull/63
See the meta-issue: https://github.com/alan-turing-institute/MLJ.jl/issues/642
For Probabilistic models, the Quick Start Guide says the returned value of
predict
should be aIn Bayesian modeling, the posterior distribution of the parameters leads to a correlation on the predictions, even if noise on top is this happens to be independent.
Is it currently possible to have correlated responses in
predict
ions? If not, I'd love to see an update to allow this. For example, this would be very easy ifpredict
were allowed to return aDistribution{Vector{t}}
instead of aVector{Distribution{T}}
.For some more context, I'd like to use this for
SossMLJ.jl
and an MLJ interface forBayesianLinearRegression.jl
.EDIT: Oops, the types I have here for
Distribution
are wrong, the type here was supposed to indicate the type of the return value.