Correlated predictions for Probabilistic models

cscherrer commented 4 years ago

For Probabilistic models, the Quick Start Guide says the returned value of predict should be a

vector of Distribution objects, for classifiers in particular, a vector of UnivariateFinite

In Bayesian modeling, the posterior distribution of the parameters leads to a correlation on the predictions, even if noise on top is this happens to be independent.

Is it currently possible to have correlated responses in predictions? If not, I'd love to see an update to allow this. For example, this would be very easy if predict were allowed to return a Distribution{Vector{t}} instead of a Vector{Distribution{T}}.

For some more context, I'd like to use this for SossMLJ.jl and an MLJ interface for BayesianLinearRegression.jl.

EDIT: Oops, the types I have here for Distribution are wrong, the type here was supposed to indicate the type of the return value.

fkiraly commented 4 years ago

@DilumAluthge, from your explanations, I do think it is now mostly clear to me what you want. Thanks a lot for being constructive and helpful.

You would want Bayesian models to be able output an integration kind of predictive posterior across samples. That is, if in your model (in the simpler special case of everything being continuous), you have:

a parameter posterior, say of pdf $p(\theta | traindata)$ varying in $\theta$, of type Distribution{typeof(theta)}, estimated on the training set
a predictive likelihood, say with conditional pdf $p(y|x,\theta)$ where $x$ is the feature value and $y$ is the label value
a test set $(X_1,Y_1),...,(X_M,Y_M)$, assumed coming from the i.i.d. generative data distribution. Let's say $Y_i$ are of type typeY.

you want to return the integrated predictive posterior $\int p(y_1|X_1, theta) ... p(y_M|X_M, theta) p(\theta | traindata) dtheta$ of type Distribution{Array{typeY,1}}, as a joint prediction for the $(Y_1,...,Y_M)$.

As you outlined above, this will not in general be a product of marginals.

Therefore, in terms of representation, this will in general be different from returning a vector with the distributions $\int p(y_i|X_i, theta) p(\theta | traindata) dtheta$, which would be of type Array{Distribution{typeY},1}.

Where in all of the above, the integrals are evaluated by MCMC (or computed analytically, if conjugate), and the representation is as a distribution which has a pdf in the argument(s) $y_i$.

Can you confirm whether my understanding is correct?

fkiraly commented 4 years ago

@DilumAluthge, for the case that my understanding is correct:

I think asking for this is not a good idea, and relies on conflation of two "kinds" of distributions in the Bayesian framework:

belief distributions, i.e., objects of type Distribution{mytype} that encode a Bayesian belief over the value of an object with type mytype
frequency distribution, i.e., objects of type Distribution{mytype} that encode a frequency distribution over an random object taking values in the type mytype

These are not the same, and should not be conflated. It is, however, a common conflation - since both have the same type. You can typically distinguish the two by considering that:

belief distributions, in well-posed models (identifiable, not overparameterized), converge to a delta distribution in the limit of training data. For example, you would expect that your inference on the parameter $\theta$ becomes more and more certain as your training sample size increases. That is, the distribution with pdf $p(\theta | traindata)$ will converge to a delta at the "true" value of $\theta$. Intuitively, your "belief" approaches certainty.
frequency distributions, such as likelihoods, or predictive posteriors, do not exhibit that behaviour. For example, the predictive posterior distribution (with pdf in variable y_i) $\int p(y_i|X_i, \theta) p(\theta | traindata) dtheta$ converges to some posterior distribution, which may or may not be equal with the "true" posterior (given the "true" model). The likelihood distribution (with conditional pdf in variables y,x,\theta) $\int p(y|x, \theta)$ is assumed as part of the model specification, and does not "converge" in this sense. Intuitively, it's an error of categories to even call these a "belief".

How this applies to problems in our discussion specifically:

the "predictive posterior" obtained by the integral rule (integrate out $\theta$) is, as said, a frequency distribution, not a "proper" posterior belief distribution! It is a heuristic to obtain a predictive frequency distribution. The dependence between the sample components, in what looks like the "natural" generalization, is a side effect of conflating the heuristic to obtain a frequency predictor with the "true" posterior on the predictive distribution.
the "true" (belief) posterior of the predictive distribution is a (belief) distribution over (frequency) distributions! Namely, it is the law of the random variable $p(y_i|X_i, \Theta)$, where $X_i$ is fixed (conditioned on), and $\Theta$ is distributed according to the parameter posterior, i.e., the distribution with pdf $p(\theta | traindata)$. This is a distribution over distributions! It has scitype Distribution{Distribution{typeY}}.
if taken over the entire test sample, you are looking at the distribution of the random variable $\prod_{i=1}^M p(y_i|X_i, \Theta),$ with $X_i$ fixed, and $\Theta$ as above. This has scitype Distribution{Array{Distribution{typeY}},1}.
that "true" posterior of the joint predictive distribution will, by the way, be a distribution over i.i.d. distributions.
thus, when applying the heuristic to this "true" posterior, you are left with the vector of marginals, rather than with your construction

Hope this makes sense? Let me know if any questions.

DilumAluthge commented 4 years ago

you want to return the integrated predictive posterior $\int p(y_1|X_1, theta) ... p(y_M|X_M, theta) p(\theta | traindata) dtheta$ of type Distribution{Array{typeY,1}}, as a joint prediction for the $(Y_1,...,Y_M)$.

As you outlined above, this will not in general be a product of marginals.

Therefore, in terms of representation, this will in general be different from returning a vector with the distributions $\int p(y_i|X_i, theta) p(\theta | traindata) dtheta$, which would be of type Array{Distribution{typeY},1}.

Where in all of the above, the integrals are evaluated by MCMC (or computed analytically, if conjugate), and the representation is as a distribution which has a pdf in the argument(s) $y_i$.

Can you confirm whether my understanding is correct?

This is correct.

DilumAluthge commented 4 years ago

that "true" posterior of the joint predictive distribution will, by the way, be a distribution over i.i.d. distributions.

thus, when applying the heuristic to this "true" posterior, you are left with the vector of marginals, rather than with your construction

This is not relevant to my use case. I do not have the "true posterior", so I cannot do any construction that involves applying any heuristic to the "true posterior".

The only objects I have are the ones I described in my "silly model" example.

DilumAluthge commented 4 years ago

converges to some posterior distribution, which may or may not be equal with the "true" posterior (given the "true" model).

Again, in my use case, I am making no assumptions about the "true model" or the "true posterior", so any construction that requires knowledge about the "true model" or "true posterior" is of no use to me.

DilumAluthge commented 4 years ago

I also want to point out that this new JointProbabilistic model is not restricted to Bayesian models.

Basically, I want to be able to create supervised models for which the predict method is allowed to output any object of type Distribution.

In fact, when it comes to these JointProbabilistic models, because they are not necessarily Bayesian models, they may not even have the notion of priors, likelihoods, posteriors, etc.

So let me make an even more general proposal that will cover all use cases.

DilumAluthge commented 4 years ago

My new proposal for the new `JointProbabilistic` model

We will add a new supervised model type named JointProbabilistic defined as such:

abstract type JointProbabilistic <: Supervised end

The only two methods that a JointProbabilistic is required to implement are fit and predict.

Fitting

A JointProbabilistic model must implement the fit method as follows:

MLJModelInterface.fit(model:: JointProbabilisticExampleModel, verbosity::Int, X, y) -> fitresult, cache, report

Predicting

A JointProbabilistic model must implement the predict method as follows:

MMI.predict(model::JointProbabilisticExampleModel, fitresult, Xnew) -> some_distribution::Distribution{T} where T

In other words, when you call predict on a JointProbabilistic model the output will be an object some_distribution which is of type Distribution{T} where T.

DilumAluthge commented 4 years ago

@DilumAluthge, for the case that my understanding is correct:

I think asking for this is not a good idea, and relies on conflation of two "kinds" of distributions in the Bayesian framework:

belief distributions, i.e., objects of type Distribution{mytype} that encode a Bayesian belief over the value of an object with type mytype

frequency distribution, i.e., objects of type Distribution{mytype} that encode a frequency distribution over an random object taking values in the type mytype

These are not the same, and should not be conflated. It is, however, a common conflation - since both have the same type. You can typically distinguish the two by considering that:

belief distributions, in well-posed models (identifiable, not overparameterized), converge to a delta distribution in the limit of training data. For example, you would expect that your inference on the parameter $\theta$ becomes more and more certain as your training sample size increases. That is, the distribution with pdf $p(\theta | traindata)$ will converge to a delta at the "true" value of $\theta$. Intuitively, your "belief" approaches certainty.

frequency distributions, such as likelihoods, or predictive posteriors, do not exhibit that behaviour. For example, the predictive posterior distribution (with pdf in variable y_i) $\int p(y_i|X_i, \theta) p(\theta | traindata) dtheta$ converges to some posterior distribution, which may or may not be equal with the "true" posterior (given the "true" model). The likelihood distribution (with conditional pdf in variables y,x,\theta) $\int p(y|x, \theta)$ is assumed as part of the model specification, and does not "converge" in this sense. Intuitively, it's an error of categories to even call these a "belief".

How this applies to problems in our discussion specifically:

the "predictive posterior" obtained by the integral rule (integrate out $\theta$) is, as said, a frequency distribution, not a "proper" posterior belief distribution! It is a heuristic to obtain a predictive frequency distribution. The dependence between the sample components, in what looks like the "natural" generalization, is a side effect of conflating the heuristic to obtain a frequency predictor with the "true" posterior on the predictive distribution.

the "true" (belief) posterior of the predictive distribution is a (belief) distribution over (frequency) distributions! Namely, it is the law of the random variable $p(y_i|X_i, \Theta)$, where $X_i$ is fixed (conditioned on), and $\Theta$ is distributed according to the parameter posterior, i.e., the distribution with pdf $p(\theta | traindata)$. This is a distribution over distributions! It has scitype Distribution{Distribution{typeY}}.

if taken over the entire test sample, you are looking at the distribution of the random variable $\prod_{i=1}^M p(y_i|X_i, \Theta),$ with $X_i$ fixed, and $\Theta$ as above. This has scitype Distribution{Array{Distribution{typeY}},1}.

that "true" posterior of the joint predictive distribution will, by the way, be a distribution over i.i.d. distributions.

thus, when applying the heuristic to this "true" posterior, you are left with the vector of marginals, rather than with your construction

Hope this makes sense? Let me know if any questions.

I understand and appreciate that you think that having a supervised model type for which the predict method returns an object of type Distribution{T} where T is not a good idea.

However, I hope that you can understand that there are other people that would like to integrate their packages into MLJ that think that this is a good idea, and in fact is the only way to integrate their packages into MLJ.

So I hope that we can agree to disagree, and can still move forward with adding this new model type.

In particular, I think that this will be possible because this new model type JointProbabilistic will have absolutely no affect or change on any existing code or models in the MLJ ecosystem.

I think the new proposal I just made is broad enough and general enough to cover all use cases. And again I want to highlight that it is intended to support any package author that wants to add supervised models to MLJ in which the output predictions are of type Distribution. This new model will not be specific to Bayesian statistics in any way.

DilumAluthge commented 4 years ago

Because my new proposal is as broad and general as possible, I think that we can end the discussion about mathematics and Bayesian statistics.

This new proposal will cover any model that wants to return Distributions. Thus, we cannot make any mathematical assumptions about these models.

DilumAluthge commented 4 years ago

@ablaom @tlienart My new proposal is general enough that it also includes the section of the Adding Models for General Use page with the heading "Models that learn a probability distribution" and describes "Models that learn a probability distribution, or more generally a "sampler" object". Currently you special-case models that fit a distribution to the target y given a void input feature X = nothing. This is simply a special case of my new proposal. So we can actually unify the APIs into the most general case that anyone can use :)

fkiraly commented 4 years ago

Can you confirm whether my understanding is correct?

This is correct.

@DilumAluthge, good that we are on one page now.

However, I feel you have not carefully read my response, or appreciated its mathematical nature.

This is not relevant to my use case. I do not have the "true posterior", so I cannot do any construction that involves applying any heuristic to the "true posterior". The only objects I have are the ones I described in my "silly model" example.

It seems you misunderstand what I meant, or you misunderstand what is a "proper" Bayesian belief posterior. I mean it's a posterior that you can write using Bayes' rule without modification. Can you please make an effort to read carefully what I wrote, and explain it back to me, just like I did with your explanations until we agreed that we agree on the content? Just so we know we are both on the same page.

Short summary:

The "predictive posterior" that comes from the integral rule is not a "proper" Bayesian belief posterior in this sense. It is a predictive frequency distribution.

The conclusion of my discussion is that the "joint posterior" that you want is:

not a proper Bayesian construction
not useful in the case of i.i.d. data

More generally, it does not make sense in the i.i.d. setting, for any model, to predict joint frequency distributions over test samples, since we already know the test samples are independent.

What I've also explained above why the "joint posterior" doesn't make sense from Bayesian perspective. You're simply computing the "wrong" integral in the sense of the reasoning, and that you get joint posteriors is an artefact of that, rather than what the posterior really is.

Because my new proposal is as broad and general as possible, I think that we can end the discussion about mathematics and Bayesian statistics.

I don´t see how the second part of the sentence would follow from the first part, even if I agreed that the first part were true.

I also don´t understand what your Distribution{T} where T above means.

However, I hope that you can understand that there are other people that would like to integrate their packages into MLJ that think that this is a good idea, and in fact is the only way to integrate their packages into MLJ.

I don't think having a joint (over samples) return type makes sense. Further, if you have an empirical distribution that is possibly joint, e.g., from MCMC, it's very easy to compute marginals, so it is not a major integration impediment. Or, use the "right" formula for the predictive posterior in the first place, which leads you to the same outcome.

DilumAluthge commented 4 years ago

I also don´t understand what your Distribution{T} where T above means.

Distribution{T} where T is a UnionAll type.

julia> using Distributions

julia> Distribution{T} where T
Distribution{T,S} where S<:ValueSupport where T

julia> typeof(Distribution{T} where T)
UnionAll

DilumAluthge commented 4 years ago

My point is that my new proposal accounts for any model that produces a Distribution as output. This is in no way restricted to Bayesian models, so the specific discussion about Bayesian models is not relevant.

DilumAluthge commented 4 years ago

Actually, I'm further broadening my proposal. Instead of returning a Distribution, we will allow the JointProbabilistic model to have a predict method that returns an object of type Distributions.Sampleable.

fkiraly commented 4 years ago

Distribution{T} where T is a UnionAll type.

ah, thanks for clarifying. Makes sense.

DilumAluthge commented 4 years ago

I don't think having a joint (over samples) return type makes sense.

Again, I fully acknowledge that you don't think that makes sense. But other people have use cases for which they believe this makes sense.

If you don't need this particular feature, you do not need to use it. It will not have any changes on the existing Probabilistic models.

fkiraly commented 4 years ago

Again, I fully acknowledge that you don't think that makes sense. But other people have use cases for which they believe this makes sense.

This I believe, though it might imply non-trivial work on the interface. And I'm slightly disappointed that you don't seem to want to make the effort to understand what I've been saying - but no one can force you, of course.

I think I've outlined the relevant arguments above, so in the end @ablaom may want to weigh them up.

tl;dr, I think you are using a "bad" formula for your Bayesian posterior and/or your algorithm that you want to interface, which makes you believe you want joints across samples (currently the one motivating use case). You further seem to be subtly conflating some pieces of Bayesian theory. Also, we already know the test data are i.i.d., so predictive ditributions that depend between samples do not make much sense in the general case either.

DilumAluthge commented 4 years ago

(currently the one motivating use case)

As I point out above, another use case that this covers is the case in which you are e.g. fitting a distribution to data by e.g. kernel density estimation. Currently, this is given as a special case here:

My new proposal covers this use case as well: https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Models-that-learn-a-probability-distribution-1

DilumAluthge commented 4 years ago

So, as I understand it, you agree that I have a "joint posterior" or "predictive posterior" or "predictive frequency distribution" or whatever you want to call it.

And, if I understand you correctly, you also agree that the components of this "joint posterior" are not marginally independent.

Is that all correct?

But then you make the argument that it is not mathematically correct to construct or correct this "joint posterior", is that correct?

It would help if you could provide some sources (textbook, lecture notes, monograph, journal article, etc.) that prove why this "joint posterior" is not a useful or correct mathematical object to return.

fkiraly commented 4 years ago

As I point out above, another use case that this covers is the case in which you are e.g. fitting a distribution to data by e.g. kernel density estimation. Currently, this is given as a special case here:

You are probably referring to conditional density estimation? I don't think it is accurate to claim this is another use case: CDE give you predictive distributions that may be dependent over variables, but they are independent over the samples.

DilumAluthge commented 4 years ago

The use case of JointProbabilistic is going to be: any model in which the result of predict is a distribution, i.e. an object of type Distributions.Sampleable.

Consider the example here: https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Models-that-learn-a-probability-distribution-1

When you call yhat = predict(mach, nothing), there is absolutely no way that yhat can be a vector of distributions. yhat must be an object of type Distributions.Sampleable.

DilumAluthge commented 4 years ago

Note that objects of type Distributions.Distribution are subtypes of Distributions.Sampleable.

julia> import Distributions

julia> Distributions.Distribution <: Distributions.Sampleable
true

DilumAluthge commented 4 years ago

I've opened several pull requests.

Since there are multiple pull requests across multiple repositories, I have opened the following meta-issue to keep track of all of the pull requests: https://github.com/alan-turing-institute/MLJ.jl/issues/633

DilumAluthge commented 4 years ago

julia> import Distributions

julia> y = rand(Distributions.Normal(1,2), 100)
100-element Vector{Float64}:

julia> yhat = Distributions.fit(Distributions.Normal, y)
Distributions.Normal{Float64}(μ=0.9995819568314163, σ=1.8659336378188145)

julia> typeof(yhat)
Distributions.Normal{Float64}

julia> yhat isa Distributions.Distribution
true

julia> yhat isa Distributions.Sampleable
true

julia> yhat isa AbstractVector
false

This is an example of a Supervised model in which the yhat is not a vector of distributions. The current Probabilistic interface requires that predict ouput a yhat in which yhat is a vector of distributions. So this example cannot be a Probabilistic model. But, it will be able to be a JointProbabilistic model, or whatever we end up calling it.

DilumAluthge commented 4 years ago

@fkiraly Do you agree with the following statement:

There exist supervised machine learning models for which the predict model will return a yhat object in which yhat is of the type Distributions.Sampleable.

fkiraly commented 4 years ago

So, as I understand it, you agree that I have a "joint posterior" or "predictive posterior" or "predictive frequency distribution" or whatever you want to call it.

This is an important point in my argument! As I said, there are two kinds of distributions: belief and frequency distributions. The problem arises from not keeping them conceptually apart. I am aware that some Bayesian schools only think there are just one "kind" of distribution, and everything is just belief (which, I believe, is not a conceptually coherent belief).

I don't agree with you fully:

I agree that you are considering some distribution that is joint
I think you are not precise enough in saying what distribution it is
I disagree with what I suspect is your "implicit" opinion on what it is; alternatively, you are not computing it in the right way

But then you make the argument that it is not mathematically correct to construct or correct this "joint posterior", is that correct?

Yes, that is the type of the argument I make. Based on a certain (possibly narrow) definition of "Bayesian posterior", namely that it is a distribution which indicates the degree belief in the value of a variable, which should approach certainty in the data asymptotic limit.

It would help if you could provide some sources (textbook, lecture notes, monograph, journal article, etc.) that prove why this "joint posterior" is not a useful or correct mathematical object to return.

I wanted to cite Bernardo/Smith, Bayesian Theory, chatper 5.1.3 as a reference - though it appears I was mistaken in remembering the content, and indeed the joint predictive posterior is similar to what you propose it is constructed there. I'm slightly surprised about this, though Bernardo/Smith doesn't discuss prediction conditional on covariates, which also surprised me slightly. I'll look into this.

Bishop, by the way, provides predictive distributions for individual test points only - see e.g., section 3.2.2. I don't think the "joint" one appears in Bishop at all, does it?

fkiraly commented 4 years ago

@fkiraly Do you agree with the following statement:

There exist supervised machine learning models for which the predict model will return a yhat object in which yhat is of the type Distributions.Sampleable.

This is not a well-defined statement, because it is a matter of definition, depending on what you meant with "supervised machine learning model". Since it is not well-defined, I neither agree nor disagree, but think it's not well-defined.

You can of course define your supervised ML model in this way, but then I would contest that such definition is sensible, or the most useful one for an ML toolbox framework.

Perhaps a more useful discussion is: what would you do with an output of type Distributions.Sampleable? How would you evaluate the utility of such an output?

DilumAluthge commented 4 years ago

This is not a well-defined statement, because it is a matter of definition, depending on what you meant with "supervised machine learning model". Since it is not well-defined, I neither agree nor disagree, but think it's not well-defined.

You can of course define your supervised ML model in this way, but then I would contest that such definition is sensible, or the most useful one for an ML toolbox framework.

Consider the specific example above, taken directly from the MLJ documentation, for a model that fits a distribution to data. For example, in this case, you provide a vector y, and the model tries to fit a univariate normal distribution to the data, which it then returns as the output of predict. The model is a subtype of Supervised. Do you believe that this example is a machine learning model that is appropriate for MLJ?

Perhaps a more useful discussion is: what would you do with an output of type Distributions.Sampleable? How would you evaluate the utility of such an output?

There is no single answer to this question. The authors of such models will define appropriate performance evaluation metrics.

For example, for Soss models, I imagine that Chad and I will implement some performance evaluation metrics.

fkiraly commented 4 years ago

Consider the specific example above, taken directly from the MLJ documentation,

Can you provide a link to the example (above, where is it?), and to the MLJ docs, please?

There is no single answer to this question. The authors of such models will define appropriate performance evaluation metrics.

But this is an important question! You want X to be implemented. So, what are the most common and important things X is used for? What is the most common way to measure whether X was good? Pointers/examples would be helpful.

Saying "there are many things" is just as helpful as saying nothing here...

DilumAluthge commented 4 years ago

Consider the specific example above, taken directly from the MLJ documentation,

Can you provide a link to the example (above, where is it?), and to the MLJ docs, please?

https://alan-turing-institute.github.io/MLJ.jl/dev/adding_models_for_general_use/#Models-that-learn-a-probability-distribution-1

I have provided this link multiple times in this pull request.

fkiraly commented 4 years ago

Also, what's a soss model? Genuienly unaware/curious. (lit ref please)

DilumAluthge commented 4 years ago

There is no single answer to this question. The authors of such models will define appropriate performance evaluation metrics.

But this is an important question! You want X to be implemented. So, what are the most common and important things X is used for? What is the most common way to measure whether X was good? Pointers/examples would be helpful.

Here is a concrete example. Suppose I am doing multiclass classification in Soss. An example performance metric is: expected value of the Brier score.

Soss is one of the probabilistic programming languages (PPLs) in Julia: https://github.com/cscherrer/Soss.jl

DilumAluthge commented 4 years ago

Thanks for the Bernardo/Smith recommendation.

If you look in section 5.1.6 (should start on page 263):

This "predictive density" is exactly the object that I want to return from the predict method, in the example that you and I have been discussing.

As you said above, Bernardo and Smith endorse the construction of this object.

If you have the time, it would be great if you could find a source that explains why this construction is incorrect.

fkiraly commented 4 years ago

This "predictive density" is exactly the object that I want to return from the predict method, in the example that you and I have been discussing.

No, I feel this is likely a misunderstanding of yours, of the notation. In Bernardo/Smith, the x/y are not features/labels, as you might now think! Just because they are y-s and x-es does not mean it is the same as in the supervised learning setting. Instead, the x/y are what one could refer to as "training"/"test" set.

It is hence not identical with what you are looking for - you need a predictive density in the predictive case that's conditional on covariates. Which Bernardo/Smith, to my surprise, does not contain, as far as I could see - this is conditional on the "training set" only. The formulation is for a generative distribution (unconditional on covariates).

Thus, as far as I see, you cannot argue that the Bernardo/Smith book would advocate, or endorse, the kind of return type you want.

fkiraly commented 4 years ago

If you have the time, it would be great if you could find a source that explains why this construction is incorrect.

In science, the burden of proof is with the one making the positive claim - i.e., you need to prove that what you're doing is sensible. https://en.wikipedia.org/wiki/Hitchens%27s_razor https://en.wikipedia.org/wiki/Argument_from_ignorance

DilumAluthge commented 4 years ago

I see what you mean. So you are saying that Bernardo and Smith are constructing:

p(y_testing | y_training)

And your point is that I want to construct:

p(y_testing | y_training, x_training, x_testing)

Are you arguing that the construction of p(y_testing | y_training) is valid but the construction of p(y_testing | y_training, x_training, x_testing) by the same method is not valid?

fkiraly commented 4 years ago

Are you arguing that the construction of p(y_testing | y_training) is valid but the construction of p(y_testing | y_training, x_training, x_testing) by the same method is not valid?

No, I merely say that the reference does not contain a construction for p(y_testing | y_training, x_training, x_testing) (and that this surprised me). I'm also not sure whether the "construction for p(y_testing | y_training, x_training, x_testing) by the same method" is identical with yours.

Further, on a minor note, one could also worry that a naive construction for p(y_testing | y_training, x_training, x_testing) leaks information from parts of the test set to other parts of the test set - aren't all the other test features used then to fit the predictive method?

DilumAluthge commented 4 years ago

I think the situation is more like this:

The author of a popular Julia PPL (https://github.com/cscherrer/Soss.jl) would like to integrate his PPL library into MLJ. There are currently no PPLs integrated in MLJ. And, as far as I understand it, the authors of the other Julia PPLs do not have the time and energy to spend on integrating their PPLs into MLJ.

Additionally, the author is willing to take the lead on integrating his PPL into MLJ: see e.g. Chad's work in the https://github.com/tlienart/SossMLJ.jl and https://github.com/cscherrer/SossMLJ.jl repositories.

However, in order for him to do so, there will need to be a new feature added to MLJ, namely the ability to have supervised machine learning models for which the predict method outputs objects of the type Distributions.Distribution, or more generally of the type Distributions.Sampleable.

How much effort are you and the other MLJ team members willing to spend helping Chad integrate Soss into MLJ?

At this point, I have spent more time on this discussion than I can justify. I apologize, but I cannot spend more time on this discussion.

Thank you to everyone that has been a part of this discussion, including but not limited to: @cscherrer, @azev77, @ablaom, @fkiraly, and @tlienart. (My apologies if I have inadvertently omitted anyone from this list!) I know everyone has put a lot of energy and effort into this discussion. I am very grateful for the time that everyone has spent commenting on this issue.

fkiraly commented 4 years ago

How much effort are you and the other MLJ team members willing to spend helping Chad integrate Soss into MLJ?

Me? 0.

But I'm not an active MLJ team member, so everyone is very welcome to ignore my ramblings and not consider my opinion in any way "official" for MLJ :-) Just interested in supervised probabilistic predictive models really (and I have been involved with desiging the proba interface).

How much effort are you and the other MLJ team members willing to spend helping Chad integrate Soss into MLJ?

I'd say: Chad should be open to work towards interface contracts by the MLJ team, instead of insisting on a substantial re-write that may affect other users - unless the MLJ team is onboard with a substantial re-write (in which case Chad may want to drive it).

I apologize, but I cannot spend more time on this discussion.

Sorry for that, I don't want this to keep you off the thread - I'll withdraw then and finish posting stuff, since I've already posted my opinions on this; the thread now also contains some references and clarifications too that will hopefully be useful to the MLJ team. Feel free to continue discussing here, all.

cscherrer commented 4 years ago

Also, we already know the test data are i.i.d., so predictive ditributions that depend between samples do not make much sense in the general case either.

You're not understanding. Here, maybe this will help. you...

Suppose we have an unknown parameter θ, and just two observations, (x1,y1) and (x2,y2). These are not i.i.d. (such models are just not interesting), but are exchangeable, i.e., they're conditionally independent, given θ.

So we're given P(θ) P(y1|x1, θ) P(y2|x2, θ) (with the same functional form)

Then our goal is to compute

P(y2 | x1, y1, x2)
= ∫ P(y2, θ | x1, y1, x2) dθ
= ∫ P(y2 | θ, x1, y1, x2) P(θ | x1, y1, x2) dθ
= ∫ P(y2 | θ, x2) P(θ | x1, y1) dθ

This shows how the dependence on y values arises from a common dependence on θ. In particular, the typical way to sample from this is

Sample θ from the posterior
Sample y2 from P(y2 | θ, x2)

In science, the burden of proof is with the one making the positive claim - i.e., you need to prove that what you're doing is sensible.

This discussion has not been a technical challenge but a pedagogical one. What we've proposed has been well-established through a long history of active research.

But I'm not an active MLJ team member, so everyone is very welcome to ignore my ramblings and not consider my opinion in any way "official" for MLJ :-)

Works for me.

fkiraly commented 4 years ago

This discussion has not been a technical challenge but a pedagogical one.

You're being very impolite here... That's basically saying I'm stupid and childish. I tend to think that lines such as these reflect more on the writer than they have relevance for the reader.

Suppose we have an unknown parameter θ, and just two observations, (x1,y1) and (x2,y2). These are not i.i.d. (such models are just not interesting), but are exchangeable, i.e., they're conditionally independent, given θ.

PS: you're conflating the generative data model and the model based inferences.

But now I'm really gone, bye :-)

azev77 commented 4 years ago

@fkiraly we are all on the same team! I think there were a lot of honest misunderstandings in this discussion.

At least that’s been true for me. Now I’m less confused then before, thanks to this discussion...

ablaom commented 4 years ago

Yes, thanks to all for the discussion, which I have just caught up on now.

I shall reflect on this a little more before responding in the next few days.

vollmersj commented 4 years ago

Thank you everyone for discussion that has been mostly very patient. I can understand the frustration that arose of not having a joint black/white board.

I really appreciate all the comments that have been from @fkiraly who very careful think about probablistic prediction and @csherer who has created a very flexible PPL which from the start supported posterior predictives directly. I remember a discussion with @fkiraly about should the predictive distribution really be random measure as opposed to posterior averaging (this is particularly important for HMMs where marginally the prediction would look unimodal but as a distribution over trajectories it might be multimodel.)

!Potential! Misunderstandings I spoted are=

p(y_testing | y_training, x_training, x_testing)leaks information about the test set
the i.i.d. case there is a difference if (X,Y) are i.i.d and X and Y are each i.i.d. and independent of each other
In this setting with a general bayesian regression model e.g. $I=1:n$ and $J=(n+1):m$ p(theta |x_{I},y_{I})\propto p(x_{I},y_{I}|theta)p(theta)
for some new data there are two problems - Evaluating the prediction for new data $x_J,y_J$ jointly or marginally:
1. log marginal likelihood = log score $\sum{j\in J} \int \log( p(x{J},y{j}|theta)) p(theta |x{I},y_{I})d\theta$
2. log marginal likelihood = log score $\sum{j\in J} \log( py{J}| x{J}, x{I},y_{I} )$ Effectively treating x at not random. Both is done in practice and in theory e.g. Ghosal, Ghosh et al results on posterior consistency for random and non-random covariates.

Moving on a key question is reporting uncertainties and evaluations. I liked the discussion early on about measurements.jl:

Uncertainties and posterior exporting

Even if having access to a join posterior credible sets can have different forms (e.g. ellipsoids)
I really like Distributions.sample e.g. MLJ could query for more samples
do we capture mismatch true posterior numerical approximation (is hard)

Validation

There are lots of metrics for evaluating probability distribution to data-generating process e.g. Wasserstein, KL, Stein, proper scoring rules (log-loss, Brier)
valuable point of the complexity of constructing valuable metrics over mixed spaces @fkirlay

Also we should look at how this is done elsewhere e.g. mlr3proba or @fkiraly skpro see e.g. https://github.com/alan-turing-institute/skpro/blob/master/skpro/vendors/pymc.py .

More later ;)

cscherrer commented 4 years ago

Thanks @vollmersj , ...

* p(y_testing | y_training, x_training, x_testing)leaks information about the test set

Can you give some more details here? It seems to me predicting y_testing | y_training, x_training, x_testing is pretty universal in supervised learning, and the only distinction here is to represent this probabilistically. But maybe I'm missing something.

* the i.i.d. case there is a difference if (X,Y) are i.i.d and X and Y are each i.i.d. and independent of each other

I didn't see anyone suggesting the former. My point was that to be very precise about it, the Y values in, say, a normal linear model are not i.i.d. If they were, you wouldn't need the Xs! Instead they're conditionally independent, given X. That works for the frequentist case where the parameters are "fixed but unknown". In Bayesian analysis we also need to include the parameters in the conditional. This leads to dependence in the predictions, because we're never given the parameters.

for some new data there are two problems - Evaluating the prediction for new data $x_J,y_J$ jointly or marginally:

1. log marginal likelihood = log score $\sum_{j\in J} \int \log( p(x_{J},y_{j}|theta))  p(theta |x_{I},y_{I})d\theta$

2. log marginal likelihood = log score $\sum_{j\in J} \log(  py_{J}| x_{J}, x_{I},y_{I} )$
   Effectively treating x at not random.
   Both is done in practice and in theory e.g. Ghosal, Ghosh et al results on posterior consistency for random and non-random covariates.

This is really interesting, as it's an entirely different sense of "joint" than I've been discussing. I've been taking X as known. Then, even for univariate Y, there are correlations between the Ys.

Please let me know if this still isn't clear. I've tried to give detailed examples, but if you help me see what's not getting through I'm happy to dig in some more.

Uncertainties and posterior exporting

* Even if having access to a join posterior  credible sets can have different forms (e.g. ellipsoids)

* I really like Distributions.sample e.g. MLJ could query for more samples

* do we capture mismatch _true posterior_ numerical approximation (is hard)

There's a wide range of possibilities here. I've been assuming the MLJ side here is an abstract type, does that sound right?

In some cases like a Gaussian, the distributional result can be exact. Otherwise, it will often take the form of a pair:

A vector of "particles" of parameters drawn from the posterior, and
A function from (x,theta) to a distribution over y (the conditional distribution I mentioned above)

There's an example of this here: https://github.com/cscherrer/SossMLJ.jl/issues/7

Validation

* There are lots of metrics for evaluating probability distribution to data-generating process e.g. Wasserstein, KL, Stein, proper scoring rules (log-loss, Brier)

* valuable point of the complexity of constructing valuable metrics over mixed spaces @fkirlay

These metrics are great in cases where we have a "true" distribution and would like to evaluate some approximation to it. I don't see that that's really the case here. Given the model assumptions, we can find a function to sample from the posterior predictive distribution. Where do you see an approximation coming into play? Or maybe you mean this would be for cases where we can compute the posterior but we choose to approximate it, maybe with variational inference or a more constrained model?

OTOH it would be very valuable to have some capability to do posterior predictive checks. I can do this from Soss, but maybe some of it needs to be from the MLJ side as well. Then again, this may be getting ahead of ourselves :)

ablaom commented 4 years ago

For better readability of editable comments below:

ppl.pdf

Yes, @vollermsj, I certainly miss being in the same room with a blackboard!

Here is my initial response to the proposal as kindly detailed by @DilumAluthge. I'm sorry this does include some more technical discussion. Be assured I appreciate acutely your patience to endure these so far. I am being more verbose than you may like, but only to mitigate further possible misunderstandings.

1. Loss estimates in MLJ and their scope of application

At present, the only kind of probablistic supervised learning model that MLJ designed to interface is a model that:

(i) Assumes data $(X_1, y_1), (X_2, y_2), ... $ is generated by an i.i.d process; and

(ii) Is capable of delivering, after seeing training data $D$, a probability distribution $p (y | x, D)$, defined for each new single input observation $x$.

Given a probabilistic scoring rule (e.g., Brier score) the expected loss of the model is then well-defined. There a number of algorithms, such as cross-validation, implemented in MLJ (and all such ML toolboxes) that take the function $p$ as input and estimate this loss. While not without controversy, these estimates are ubiquitous and well-studied. Furthermore, both the definition of the expected loss, and the algorithms for estimating the loss do not depend on any other feature of the model (e.g., "model is Bayesian", or "model is linear"). It is therefore possible to compare all such models in a consistent way using such estimates, which is crucial.

2. Goals for Bayesian model / MLJ interaction

Here are goals that have been articulated so far, as best as I can gather:

(i) We integrate into MLJ Bayesian models that fit into the framework outlined in 1.

(ii) Certain functionality of Bayesian models not shared by all models (but not unique to them) is exposed in MLJ. Specifically, "correlated predictions" (see 5. below) should be exposed.

(iii) New functionality is added to MLJ that would allow evaluation of Bayesian models in ways that do not fit into the framework outlined in 1, even though the models themselves may do so. This goal requires (ii). (Here I'm thinking of things like implementing brier_score(dist::Distribution{Vector{T}}) where T, as discussed in @DilumAluthge's proposal.)

(iv) We additionally integrate Bayesian models that do not fit into the framework outlined in 1, such as models for non i.i.d. processes. Here "integrate" is not the best word, because currently MLJ has little to offer in the way of meta-algorithms to support such models. But the implication seems to be that realizing (iii) would change this (?)

3. Comment

In principle, I do not have objections to any one goal. However:

For me (and I expect most general MLJ users) (i) is the highest priority. It seems to me (i) can be achieved independently of the other goals. I would not support adding i.i.d. Bayes models to MLJ that can be fit into framework 1 but do no actually implement the necessary part of the API needed to include them. This is not to say that i.i.d. models are only valuable as part of the framework, only that integration into MLJ only makes sense if they participate. I realize that to implement this goal it may be necessary to generalize some measures so that they can deal with (vectors of)Sampler predictions, and not just Distriubtion predictions, which I would support.
It seems (ii) can be readily achieved somehow. However, I do see a flaw in the current proposal for doing so, in which goal (i) is compromised. See 6 below. I may also simply misunderstand the proposal.
I don't believe (iii) is a trivial undertaking. I therefore suggest these enhancements be added by a new third party package. Here are some reasons: (a) MLJBase is already large and the performance measures interface (which should be a package in its own right) is large, growing and a bit complicated; (b) Limited resources now mean it's hard to justify an enhancement that is neither small, simple, nor adding value across the board (to all models); (c ) Having this externalized might help you rally the necessary expertise and would give you independence (you wouldn't have to wait a week+ for every PR review from me).
Well, (iv) seems to depend on (iii). Realistically, it is probably out-of-scope for now.

4. A clarification of the API for probablisitic models

Before responding to the specific design proposal, I think I need to clarify the relationship between the API specs and the framework defined in 1.

While the MLJ API specifies that each Probabilistic model should implement a predict(mach, Xnew) method that returns a vector of probability distributions [d1, d2, ..., dk] for each multi-observation input X (a table with k rows, say) it is tacitly assumed that this method is equivalent to broadcasting a single observation predict method, corresponding to the distribution $p$ above. In other words, predict should just be an implementation of the vector-valued function $P$, given by

$$P (y_1, y_2, ..., y_k | x_1, x_2, ..., x_k, D) = (p(y_1 |x_1, D), p(y_2 |x_2, D), ..., p(y_k | x_k, D)).$$

This assumption is necessary unless we agree to depart from the framework 1 (which would exclude us from comparing all models in a consistent way).

Let me note here a trivial corollary of our assumption: the single component of $P(y_1 | x_1, D)$ is the same thing as the first component of $P(y_1, y_2 | x_1, x_2, D)$, or in MLJ syntax:

predict(mach, Xnew[1, :])[1] == predict(mach, Xnew)[1, :]

for any table Xnew with two rows. I will call this property consistency below.

5. Correlated predictions for "mixture models"

Distilling previous disucssions:

Given a family of probabilistic predictors $p_\theta$, parameterized by $\theta$ (each fitting into the framework of 1 above) and a mixing pdf $w(\theta)$ (possibly depending on the training data $D$) then we can construct a multivariate distribution function

$$ p_{corr}(y_1, ..., y_k | x_1, ..., x_k |D) = \int \prodi p\theta(y_i | x_i, D) d\theta $$

whose marginals are generally correlated. This framework includes Bayesian models, where $w(\theta)$ is the posterior for model hyperparameters.

Note that if we take the special case $k=1$ our multivariate distribution becomes a univariate one, and we obtain a candidate $p(y| x, D)$ for placing a mixture model into the framework 1 (I'm assuming the setting is i.i.d data). However, this does not appear to factor into @DilumAluthge's proposal.

6. On the proposal to integrate models into MLJ

To achieve goal 2(ii) @DilumAluthge is proposing that for a class of models with the new subtype ProbablisiticJoint, we should declare that predict(mach, Xnew) return a representation of correlated predictions $p_{corr}$, evaluated on the rows x_1, ..., x_k of the table Xnew. (Actually, version 2 of the proposal just says this needs to be probability distribution, but in that case there is no suggestion as to how to fit the model into framework 1.)

As I understand it (and maybe I have this wrong) one then obtains the "vector of distributions" required for fitting the model into framework 1 by computing the marginals of p_corr? If that is so, and we call the result of this operation predict_marginals(mach, Xnew), then it must be consistent, in the sense of 4. That is, we require

predict_marginal(mach, Xnew[1, :])[1] == predict_marginal(mach, Xnew)[1, :]

for any table Xnew with two rows. This is evidently not the case. An equal mixture of two binary classifiers already provides a counter example.

I would be very surprised if there is any way to construct a consistent "vector of distributions" from the correlated predictions predict(mach, Xnew). Of course, the predict function (as opposed to a single evaluation) can be used to get this, as the last observation in 5 shows. But the idea that we can convert a ProbablisiticJoint model into a regular Probabilistic model by simply composing it with a marginalization operation (or any operation) would not appear to work, right?

7 Comment

Very happy to see a revised update to the proposal or have my misunderstandings corrected. However, my own view is that this is not the right approach. Since i.i.d Bayes models are expected to implement framework 1, they must share all the behaviour of the existing Probabilistic models and so ought to have this type. Extra functionality goes on top by adding methods, such as predict_joint. To flag those models that support the extra functionality, we could introduce a subtype JointProbabilistic <: Probabilistic or a trait. (Actually, the implemented_methods trait may already serve this purpose.)

I understand @DilumAluthge has already given this approach a lot of thought, which I appreciate:

I believe that this would lead to way too much code churn throughout the entire MLJ ecosystem. Additionally, it would require a lot of breaking changes in both the MLJ ecosystem as well as all Julia packages that currently implement Probabilistic MLJ models. I think that this would be quite disruptive and would require a lot of person-hours.

It's not clear to me is why adding functionality should be disruptive. Could you give an example?

DilumAluthge commented 4 years ago

Thanks @ablaom! I'll defer to @cscherrer on the technical specifics.

Broadly speaking, what I'm hearing is that implementing Bayesian models within the existing Probabilistic framework will be preferable to adding a new JointProbabilistic model type.

In that case, we will require Bayesian models (like all Probabilistic models) to implement a predict method that satisfies all of the the following criteria:

predict returns an AbstractVector of Distributions.Sampleables
this predict method is equivalent to broadcasting a single observation predict method
predict meets the consistency property as defined in @ablaom's comment above.

All of this is consistent with the current implementation of the Probabilistic interface, except that we are expanding predict to allow an AbstractVector of Distributions.Sampleables instead of just an an AbstractVector of Distributions.Distributions. This is based on the following snippet from @ablaom's comment:

I realize that to implement this goal it may be necessary to generalize some measures so that they can deal with (vectors of) Sampler predictions, and not just Distribution predictions, which I would support.

Then, in addition, we allow Probabilistic models to optionally implement a predict_joint method. We can add a subtype JointProbabilistic <: Probabilistic for these models.

DilumAluthge commented 4 years ago

Summary:

Allow Probabilistic models to have predict methods that return AbstractVector{<:Distributions.Sampleable}, instead of just AbstractVector{<:Distributions.Distributions}.
Generalize performance metrics to accept AbstractVector{<:Distributions.Sampleable} as input, instead of just AbstractVector{<:Distributions.Distributions}.
Add the subtype JointProbabilistic <: Probabilistic.
Models of type JointProbabilistic must implement the predict_joint method. The result of predict_joint must be of type Distributions.Sampleable.

DilumAluthge commented 4 years ago

Of course, this all requires PPLs (such as Soss.jl) to have a predict method that returns AbstractVector{<:Distributions.Sampleable}.

DilumAluthge commented 4 years ago

Number 3 and number 4 are implemented by: https://github.com/alan-turing-institute/MLJModelInterface.jl/pull/63

See the meta-issue: https://github.com/alan-turing-institute/MLJ.jl/issues/642

JuliaAI / MLJ.jl