Correlated predictions for Probabilistic models

cscherrer commented 4 years ago

For Probabilistic models, the Quick Start Guide says the returned value of predict should be a

vector of Distribution objects, for classifiers in particular, a vector of UnivariateFinite

In Bayesian modeling, the posterior distribution of the parameters leads to a correlation on the predictions, even if noise on top is this happens to be independent.

Is it currently possible to have correlated responses in predictions? If not, I'd love to see an update to allow this. For example, this would be very easy if predict were allowed to return a Distribution{Vector{t}} instead of a Vector{Distribution{T}}.

For some more context, I'd like to use this for SossMLJ.jl and an MLJ interface for BayesianLinearRegression.jl.

EDIT: Oops, the types I have here for Distribution are wrong, the type here was supposed to indicate the type of the return value.

ablaom commented 4 years ago

@DilumAluthge This sounds good to me. If there is a devil in the details, it will be in this part:

Generalize performance metrics to accept AbstractVector{<:Distributions.Sampleable} as input, instead of just AbstractVector{<:Distributions.Distributions}.

One question: Do we actually need this for the single-target classification metrics? Can't the sampling be done on the ppl side to get the UnivariateFinite distribution objects? This would avoid creating an interface point on the metrics side for how many samples to take, and so forth.

edit Also, it would mean we only sample once, and not every time one wants to approximate the pdf.

Feel free to spawn a new thread for this discussion.

DilumAluthge commented 4 years ago

@DilumAluthge This sounds good to me. If there is a devil in the details, it will be in this part:

Generalize performance metrics to accept AbstractVector{<:Distributions.Sampleable} as input, instead of just AbstractVector{<:Distributions.Distributions}.

One question: Do we actually need this for the single-target classification metrics? Can't the sampling be done on the ppl side to get the UnivariateFinite distribution objects? This would avoid creating an interface point on the metrics side for how many samples to take, etc.

Feel free to spawn a new thread for this discussion.

Yeah this part is tricky. @cscherrer has some thoughts here: https://github.com/cscherrer/SossMLJ.jl/issues/7

cscherrer commented 4 years ago

Really nice summary, thank you @ablaom !

This was especially helpful:

At present, the only kind of probablistic supervised learning model that MLJ designed to interface is a model that:

(i) Assumes data $(X_1, y_1), (X_2, y_2), ... $ is generated by an i.i.d process; and

(ii) Is capable of delivering, after seeing training data $D$, a probability distribution $p (y | x, D)$, defined for each new single input observation $x$.

Aha! I think I finally understand your use of iid. To this point, I had thought you (and several others) meant for the y values to be iid. But here you're talking about (x,y) pairs being iid, which makes much more sense.

This assumption doesn't hold in a Bayesian context - there (say with parameter theta) we can only say the (x,y) pairs are conditionally independent given theta.

Say we have a function

marginalize(<: Distribution{Vector}) :: Vector{Distribution}

In terms of scoring models, the correlated predictions will be most useful when making a decision based on an aggregate. Most scoring rules (including Brier) take each predicted distribution in turn, so evaluation on a given d <: Distribution{Vector} will be equivalent to evaluation on marginalize(d). The evaluation function "factors through marginalization".

For an example of an "aggregate decision" like this, say we want a simple Markov chain model to predict the price of a given stock tomorrow. x might include a "ticker symbol" ID and yesterday's price, and theta is some latent "market conditions". y is today's price for that same stock. Our model might make the assumption that a given y depends only on the corresponding x and the latent theta.

We could instead make this "wide data". But there are some reasons we might not want to do that:

Given the conditional independence assumption, we don't need multiple xs for prediction
The available data might be very irregular, and we'd rather have the model manage this irregularity than pass that responsibility to MLJ.
There could be a computational cost with more x values that we want to manage
We may have a strong preference for univariate predictions

Now, imagine we buy some collection of stocks and want to forecast its net value, which is a univariate distribution. If we only have access to the marginals, we'll drastically underestimate the variance of this.

I would be very surprised if there is any way to construct a consistent "vector of distributions" from the correlated predictions predict(mach, Xnew).

As I understand your "consistency", it's the same as saying the marginals are correct. Is that right? I agree that it's very hard to do efficiently for the general case, so the fallback method might be very slow. But at least for things like Gaussians (BayesianLinearRegression.jl) or for Soss models using MCMC, it should be relatively easy.

DilumAluthge commented 4 years ago

Meta-issue to track the JointProbabilistic <: Probabilistic subtype: https://github.com/alan-turing-institute/MLJ.jl/issues/642

DilumAluthge commented 4 years ago

We now have the JointProbabilistic model type and the predict_joint generic function, which satisfies the original use case of this issue.

@cscherrer I think this issue can be closed now?

There may be additional opportunities for design discussion and improvements, but I think those can take place in separate issues that have more narrow and specific scopes.

JuliaAI / MLJ.jl

Correlated predictions for Probabilistic models #552