Correlated predictions for Probabilistic models

JuliaAI / MLJ.jl

A Julia machine learning framework

https://juliaai.github.io/MLJ.jl/

Other

1.78k stars 157 forks source link

Correlated predictions for Probabilistic models #552

Closed cscherrer closed 4 years ago

cscherrer commented 4 years ago

For Probabilistic models, the Quick Start Guide says the returned value of predict should be a

vector of Distribution objects, for classifiers in particular, a vector of UnivariateFinite

In Bayesian modeling, the posterior distribution of the parameters leads to a correlation on the predictions, even if noise on top is this happens to be independent.

Is it currently possible to have correlated responses in predictions? If not, I'd love to see an update to allow this. For example, this would be very easy if predict were allowed to return a Distribution{Vector{t}} instead of a Vector{Distribution{T}}.

For some more context, I'd like to use this for SossMLJ.jl and an MLJ interface for BayesianLinearRegression.jl.

EDIT: Oops, the types I have here for Distribution are wrong, the type here was supposed to indicate the type of the return value.

azev77 commented 4 years ago

@cscherrer this is fascinating. Can you help me understand w/ more precise examples? The Boston housing data has 506 census tracts in Boston in 1970. For each tract (=row): y=median house value ($000s), X=13 predictors (Crime, # rooms etc).

Suppose I train the data on 406 rows. Next for the remaining 100 rows depending on the model I get:

a vector of 100 predictions of E[y_i|X_i] the expected house value in town "i" w/ X_i. Or my "estimand" can be a quantile: conditional median, conditional 90%ile whatever...
a vector of 100 univariate pdfs f(y_i|X_i)
a single multivariate pdf f(y|X) of the predicted median house price in 100 towns. From this joint dist, I can compute E[y|X], marginal pdfs, & get correlations between predictions etc I think 1 & 2 can be obtained from 3.

I think you want 3, unless I misunderstood. I don't think MLJ currently has models that will generate this kind of joint distribution of predictions.

Can you give an example of when I'd want an entire joint distribution? Can you give examples of other libraries that predict joint distributions?

BTW, I'm super excited about SossMLJ.jl !!!

Currently w/ GLM:

X, y =  @load_boston;
model = @load LinearRegressor pkg = GLM
mach = machine(model, X, y)
fit!(mach)
y_hat = predict(mach)

Distributions.Normal{Float64}(μ=30.212372064783466, σ=4.787101274343532)
Distributions.Normal{Float64}(μ=25.267233800068816, σ=4.787101274343532)
Distributions.Normal{Float64}(μ=30.849358585028362, σ=4.787101274343532)

cscherrer commented 4 years ago

Sure, let's have a look at BayesianLinearRegression.jl. There's not yet an MLJ interface for this, but I'd like to move it in that direction.

To fit the Boston housing data, you could start with something like

julia> using MLJ: @load_boston, matrix

julia> X, y =  @load_boston;

julia> using BayesianLinearRegression

julia> m = BayesianLinReg(matrix(X),y);

julia> fit!(m, callback=stopAtIteration(1000))
BayesianLinReg model

Log evidence: -1582.46
Prior scale:  1.76
Noise scale:  5.02

Coefficients:
  1: -0.10 ± 0.03
  2:  0.05 ± 0.01
  3:  0.01 ± 0.06
  4: -0.27 ± 1.56
  5:  5.77 ± 0.27
  6: -0.00 ± 0.01
  7: -0.94 ± 0.20
  8:  0.19 ± 0.07
  9: -0.01 ± 0.00
 10: -0.40 ± 0.11
 11:  0.02 ± 0.00
 12: -0.44 ± 0.05

There you can see the uncertainty on the weights. For data X and weights w, predictions are just X * w + ε, where ε is the noise. This model has uncorrelated noise, but there are correlations in the weights, as you can see with

julia> posteriorVariance(m)
12×12 Array{Float64,2}:
  0.00120121   -4.45211e-5    0.0001189     0.000795587  -0.000117752  …  -0.000595811   3.77849e-7   -0.00010241    1.03071e-5   -0.000312121
 -4.45211e-5    0.000209953   0.000107879   0.000247711  -0.0010403        0.000113984  -1.27839e-5    0.00058897   -7.61066e-7   -4.00438e-5
  0.0001189     0.000107879   0.00390144   -0.0133273     0.00132908       0.00115856   -0.000117487  -0.000762676   3.83036e-6   -0.000405398
  0.000795587   0.000247711  -0.0133273     2.43453      -0.110196         0.000550265  -0.000444113   0.000174963  -0.000219387  -0.0104082
 -0.000117752  -0.0010403     0.00132908   -0.110196      0.0754069        0.0012629    -6.94665e-5   -0.0155167    -0.000165223   0.00574599
  5.11844e-6    2.58532e-5   -6.43591e-5   -0.00272463   -0.00134445   …   4.81465e-5   -2.63727e-6   -4.73274e-5   -2.59417e-6   -0.000275098
  0.000659355  -0.00125357    0.00346245    0.0155312    -0.00808413       0.00171254   -3.87101e-5   -0.00890773   -4.38673e-5   -0.00136053
 -0.000595811   0.000113984   0.00115856    0.000550265   0.0012629        0.00447659   -0.000207638  -9.03099e-5    3.00598e-5    0.00021231
  3.77849e-7   -1.27839e-5   -0.000117487  -0.000444113  -6.94665e-5      -0.000207638   1.51238e-5   -8.25571e-5   -1.26977e-7   -1.20455e-5
 -0.00010241    0.00058897   -0.000762676   0.000174963  -0.0155167       -9.03099e-5   -8.25571e-5    0.0120152    -8.07061e-5   -0.0016579
  1.03071e-5   -7.61066e-7    3.83036e-6   -0.000219387  -0.000165223  …   3.00598e-5   -1.26977e-7   -8.07061e-5    7.29049e-6    7.62124e-6
 -0.000312121  -4.00438e-5   -0.000405398  -0.0104082     0.00574599       0.00021231   -1.20455e-5   -0.0016579     7.62124e-6    0.00243169

Does that help?

cscherrer commented 4 years ago

Come to think of it, you get this whether weights are correlated or not. The problem is that the predictive distribution depends on the weights. Predictions are (for many models) independent conditional on the weights, but for predict we want marginal uncertainty. The common dependence on parameters leads to correlations.

Oh, as for other libraries that do this, I think it comes up any time you're doing Bayesian modeling. In that case we capture the uncertainty of any parameter estimates, and sharing this uncertainty across observations will lead to correlations.

azev77 commented 4 years ago

What you're saying is true more broadly (even w/o Bayesian stuff). Under mild assumptions: the least squares estimator's asymptotic distribution is Normal. It follows that since the coefficients are asy-normal then the predictions are asy-normal.

x =[ones(506) matrix(X)]; n,p =size(x);
β̂ = x\y; ŷ = x*β̂; ε̂ = y-ŷ; s² = (ε̂'ε̂)/(n-p); Σ̂ = s²*(x'x)^-1;
sqrt(s²)
"Theory: If ε~(0,σ²*In) ⇒ β̂ ~ₐ N(β,Σ̂)"

using GLM
m = lm(x,y)
coef(m) - β̂
vcov(m) - Σ̂
GLM.predict(m) -ŷ

#I believe this implies the following
"Theory: If β̂ ~ₐ N(β,Σ̂) ⇒ ŷ=Xβ̂ ~ₐ N(Xβ,XΣ̂X')"
Predicted asy dist of y: f(y|x) =N(Xβ̂,XΣ̂X')
marginal distribution of y: f(yi|xi) =N(Xβ̂ [i],XΣ̂X'[i,i])
conditional expectation of y: E[yi|xi] = Xβ̂ [i]

vcov(m) is the covariance matrix of the estimated coef, not the predictions. x*vcov(m)*x' is the covariance matrix of the predictions. Right now: MLJ.GLM gives Normal(μ=x[i,:]*β̂, sqrt(s²)) So technically MLJ.GLM can give a joint distribution of the predictions, instead it reports the marginals.

In general when you use a statistical method to estimate a vector of parameters θ, you will have a covariance matrix of your estimator. In general the estimates are correlated. Moreover, you will have an asymptotic (or finite sample depending on your assumptions) distribution of θ (else you couldn't do inference). Using this you will have a joint distribution of predictions. I think that's what you're getting at.

@ablaom @tlienart I'm not sure MLJ.GLM gives the correct σ It sets σ_i=s² & μ_i=x[i,:]*β̂ Distributions.Normal{Float64}(μ=30.212372064783466, σ=4.787101274343532) I believe it should be σ_i = sqrt.(diag(x*Σ̂*x'))[i]

Here is source:

function MMI.predict(model::LinearRegressor, fitresult, Xnew)
    μ = MMI.predict_mean(model, fitresult, Xnew)
    σ̂ = GLM.dispersion(fitresult)
    return [GLM.Normal(μᵢ, σ̂) for μᵢ ∈ μ]
end

ablaom commented 4 years ago

Sorry for being a little dense, but I am a little confused about the distribution you are wanting your predict method to return. It would help me if, using Bishop's notation, say, you state precisely the distribution you are wanting return, in the case of Bayesian linear regression. For simplicity, let's suppose the precision parameters α and β are prescribed a priori (vanilla Bayes, not empirical Bayes). Then, could you furthermore state how one makes the passage from your distribution to the usual vector of predictive distributions (as defined by 3.57, p. 156, in Bishop). (As I read your explanation, this doesn't appear to be possible - hence my confusion.)

The reason the API requirement for predict has the form it has is because performance evaluation metrics, such as cross-entropy or Brier score, expect that form. Perhaps you can conceive of metrics which make use of the "more informative" predictions you have in mind (assuming that is indeed what they are). However, this is probably of limited utility because other probability-predicting models, with which one would like to compare to Bayesian models, either do not or cannot deliver the more informative predictions.

It's true that we could put the responsibility on the user to convert the more informative predictions to standard vectors of predictive distributions, before passing to evaluations, as we do for users who want to compare (ordinary) probablisitic models to deterministic ones. However, I strongly expect most users (who are not hard-core Bayesians) would prefer a Bayesian model to behave just like any other probabilistic supervised model.

So, I would maintain that we still need a predict method that returns a vector of distributions (the "predictive distributions"). To expose the "more informative" predictions we could add a new operation, like predict_??? (suggest a name) and define a trait or new Probabilistic subtype to mark those models that can do this. But we should agree on what the output type of such a method should look like, etc.

Alternatively, without adding to the API at all, we could expose the method as part of fitted_params or report.

cscherrer commented 4 years ago

Ok. Bishop's notation of t vs bold t is very confusing to me, so I hope it's ok to write his 3.57 like this:

eq1

One really nice thing about Monte Carlo is that integration over a given sampled variable just means ignoring its output. So sampling looks like

Sample from w | y, α, β
Sample from ynew | w, β
Ignore the w

The ynew components are independent, conditional on w. But they are not marginally independent. In a given sample, the components must share a w, or the result will be invalid. Each component on its own will be fine, but taken together they can be way off.

As for the name, maybe predict_joint? Then we could contrast this with predict which is implicitly marginal.

One other thing... in a sampling context, "prediction" could mean sampling y, or just sampling yhat (with model uncertainty, but no observation noise). Maybe we need a kwarg to distinguish the two?

ablaom commented 4 years ago

Thanks for taking the time to explain this, @cscherrer . This has forced me to straighten out some Bayesian basics that I had overlooked before. I believe I understand what you do now.

More abstractly, if I have a collection of parameterized models Ω_θ (a space plus probability measure) describing "sampling once", and I have worked out a posterior distribution p(θ), after looking as some historical data (sampling many times), then I can get a new parameter-independent model for "sampling n-times" in two non-equivalent ways: (i) I can average the Ω_θ over θ using p(θ) and then construct the product of these averages; or (ii) I can first take the product Ω_θ ⊗ ⋯ ⊗ Ω_θ and then average. Both spaces have the same "look-at-one component" marginals, but theses marginals are only mutually independent in case (i).

You argue that the space in case (ii) is more meaningful because sampling there is like taking multi-samples from the full space Ω × Θ but excluding those that do not share the same θ value, which makes sense.

Is that the gist of it?

edit

Both spaces have the same "look-at-one component" marginals

Actually, I no longer believe this. The marginals are generally different!

ablaom commented 4 years ago

@azev77 I would maintain that the implementation of GLM in MLJ is correct. The std of the predicted target distribution is independent of the feature input x, by definition. The Bayesian case is different because you do not select a particular model from the initial parameterised family but an average of these models over the posterior (or their multi-sample versions as discussed above) to get the final parameter-free model, and this averaged model is generally not in the family you started with.(Linear Bayesian is already a case-in-point. In that case the std of the probability distribution for Y given a single new observation x indeed depends on the observation.)

But maybe I am missing something?

azev77 commented 4 years ago

@ablaom I thought more about it. I think we have different estimands. You have the predicted distribution. This is likely what MLJ users expect. I have the distribution of the point prediction E[y|x]. Read my note: https://stats.stackexchange.com/questions/469195/distribution-of-ols-predictions

You have a mighty big brain.

That being said, it would be a natural extension is to allow heteroskedasticity.

ablaom commented 4 years ago

That being said, it would be a natural extension is to allow heteroskedasticity.

Feel free to open an issue. We a could at least expose more of the outcomes of fitting to make this possible, if they are not already. Maybe the coefficient/intercept variances are already returned in report or as fitted_params - I forget.

azev77 commented 4 years ago

Gonna have to think about this.

Homoskedastic (current) model: y ~ N(xβ, σ)

use training data to estimate β, σ
predicted distribution (plug in estimates): y | x_new ~ N(x_new β^, σ^)

Heteroskedastic model: y ~ N(xβ, σ=h(x,α)) w/ skedastic fcn h()

use training data to estimate β
generate squared residuals ε_i^2
use training data to estimate ε_i^2 = exp(x α) {eg: exponential-affine}
predicted distribution: y | x_new ~ N(x_new β^ , exp(x_new α^) )

Here is a recent paper that uses SVR to estimate the skedastic function (w/ R code): http://econ.ucsb.edu/~startz/FGLS_MachineLearning.pdf

cscherrer commented 4 years ago

@DilumAluthge and I had some discussion of this on Slack, to which @ablaom responded:

Just catching this discussion now. Quite swamped just now but wanted to just jump in briefly. For the reasons @dilum has re-iterated (performance metrics act on the “marginal” predictions not on the more detailed “joint” predictions) I don’t think there is justification for implementing new behaviour for predict in the case of PP models. We add a predict_joint as Chad suggested at the issue. But the PP model will need to implement predict as well, a method that returns a vector of distributions (or samplers). I see some discussion above has centred around the next question, what exactly should the object predict_joint return, which makes sense. Thanks for pushing this along.

Thanks Anthony.

Just to be clear, there's really nothing here specific to PPL. It's hard for me to imagine a useful model where predictions are really conditionally independent for a given X.

For MCMC inference, we'll end up with a collection of models. If the observations are modeled as conditionally independent, each of these would fit neatly into the current Probabilistic type.

Some things to note here:

This conditional independent often doesn't hold, for example in time-series models
It's worth considering the comparison to bagged decision trees, for which we have a collection of Deterministic models
In all of these cases, the "collection" is really a weighted average. More specifically, it's a mixture model.
There's no reason this mixture has to be discrete. Outside of PPL, it's very common to have a continuous latent space. Variational autoencoders are a good example of this, as are multivariate Gaussian models.

DilumAluthge commented 4 years ago

EDIT: this proposal has been replaced by my new proposal in this comment: https://github.com/alan-turing-institute/MLJ.jl/issues/552#issuecomment-671031252

My proposal

I've been giving this some thought. I think we need to create a new type of supervised MLJ model called something like GeneralizedProbabilistic. So we will have:

abstract type GeneralizedProbabilistic <: Supervised end

Name of this model type

Here are some ideas for naming this new model type:

GeneralizedProbabilistic
Bayesian

Bayesian is more concise, but as Chad as pointed out, this use case is not restricted to Bayesian models, so GeneralizedProbabilistic might be better.

Suggestions for other names are welcome.

The `fit` method

Fitting a GeneralizedProbabilistic model will work the same as fitting any other kind of supervised model.

The `predict` method

When the predict method is called on a GeneralizedProbabilistic model, it must return a Vector{Distribution{T}}.

Performance metric example

Suppose I have a Probabilistic model, and I want to evaluate its performance using the Brier score. I do so as follows:

I fit my model as usual.
I call predict on my model. The input to predict is the model, the fitresult, and my holdout testing data set. The output of predict is a Vector{Distribution{T}} where T is the scitype of the y that was passed to fit.
I take this Vector{Distribution{T}} and I pass it as input to brier_score. As output, I receive a single Float64, which is the Brier score of my fitted model when evaluated on the holdout testing set.

To summarize:

Calling brier_score(::Vector{Distribution{T}}) gives me the Brier score of my fitted Probabilistic model when evaluated on the holdout testing set.

Now, instead, suppose I have a GeneralizedProbabilistic model, and I want to evaluate its performance using the Brier score. I do so as follows:

I fit my model as usual.
I call predict on my model. The input to predict is the model, the fitresult, and my holdout testing data set. The output of predict is a Distribution{Vector{T}} where T is the scitype of the y that was passed to fit.
I take this Distribution{Vector{T}} and I pass it as input to brier_score_distribution. As output, I receive a Vector{Float64}, which numerically represents the probability distribution on the Brier score.

Therefore, I have arrived at the probability distribution on the Brier score. Now, I can, for example, plot this probability distribution to get an idea of what my Brier score looks like for my GeneralizedProbabilistic model. Or, for example, I could do kernel density estimation to estimate the probability density function of the probability distribution on the Brier score.

Of course, users will want a quick and easy way to summarize this probability distribution on the Brier score. Taking the expected value is one reasonable approach. So we can define convenience methods like this:

function brier_score(dist::Distribution{Vector{T}}) where T
    return mean(brier_score_distribution(dist))
end

Therefore, brier_score(dist::Distribution{Vector{T}}) is giving me the expected value of the Brier score for my fitted model when evaluated on the holdout testing set.

To summarize:

Calling brier_score(::Distribution{Vector{T}}) gives me the expected value of the Brier score of my fitted GeneralizedProbabilistic model when evaluated on the holdout testing set.
Calling brier_score_distribution(::Distribution{Vector{T}}) gives me the probability distribution of the Brier score of my fitted GeneralizedProbabilistic model when evaluated on the holdout testing set.

Of course, currently we do not have methods like brier_score_distribution(dist::Distribution{Vector{T}}) for any of our performance metrics. So, we will need to work on implementing these methods for the most popular performance metrics. I can work on implementing these methods, although I will probably need Chad to review/check my code.

Possible questions

Question: Why do we need a new type of supervised model? Why not just make modifications to the existing Probabilistic type to make it support this more general case?

Answer: After a lot of contemplation, I believe that this would lead to way too much code churn throughout the entire MLJ ecosystem. Additionally, it would require a lot of breaking changes in both the MLJ ecosystem as well as all Julia packages that currently implement Probabilistic MLJ models. I think that this would be quite disruptive and would require a lot of person-hours. In contrast, adding a new GeneralizedProbabilistic model type will not break any existing Probabilistic MLJ models or any existing implementations of probabilistic metrics.

It certainly may be the case Probabilistic type is a special case of the proposed GeneralizedProbabilistic type. However, I just don't think that we can justify the huge amount of code churn that we would cause if we made any changes to the existing Probabilistic type.

Furthermore, although it may be a special case, I think it's a very common use case, and a lot of people will work entirely in the context of the current Probabilistic models without needing to think about the more general case.

Question: Can we make GeneralizedProbabilistic a subtype of Probabilistic?

Answer: No. According to the documentation, when the predict method is called on a Probabilistic model, it must return an AbstractVector whose elements are Distributions, with one distribution for each row of the input features X that were passed to predict. As Chad has explained, there is no correct way for a Soss model to implement a predict method that outputs a vector of distributions. Therefore, these model cannot be subtypes ofProbabilistic.

This means that we need a new model type GeneralizedProbabilistic that is not a subtype of Probabilistic. Instead,GeneralizedProbabilistic will directly subtype Supervised, i.e. GeneralizedProbabilistic <: Supervised.

cscherrer commented 4 years ago

Hi @DilumAluthge , thanks for writing this up!

As I think you pointed out in Slack, the current Probabilistic is restricted to representing the marginals. In contrast, it's generally useful to have access to the joint distribution. maybe Joint should be part of the name to make this explicit?

Also, it's natural to have a function

marginalize(joint::Distribution{Vector}) :: Vector{Distribution}

that just "forgets" the dependencies between the components. Note that the result of marginalize (and Probabilistic models in general) will have a strong tendency to give much too broad credible intervals on the predictions.

fkiraly commented 4 years ago

@ablaom, @cscherrer, @DilumAluthge, let me perhaps chime in, since I spot some common mistunderstandings on probabilistic predictions.

I think there are two cases to distinguish, in terms of what one "might" want: (i) predicting a joint distribution, where joint is taken across indices in the test sets. For example, you predict a univariate regression output, but across M test points. In this case, you may like to predict a M-variate distribution, e.g., an M-variate normal parameterized by M-variate mean and an MxM covariance matrix (ii) predicting a joint distribution, where joint is taken across variables. For example, you wish to predict a multivariate outcome with n variables, in which case you wish to predict an a n-variate distribution, e.g., an M-variate normal parameterized by n-variate mean and an nxn covariance matrix.

There is of course also the "union" case, where you wish to make a prediction for M instances across n variables, where you want to predict an Mn-variate distribution. I won´t discuss this below due to reasons stated below.

One misunderstanding that I think I spot: If your data is i.i.d., then doing (i) does not make any sense, in general! Why: the marginal prediction is always at least as good, as any given joint, in terms of predictive likelihood (this is a mathematical theorem). The intuition behind this is due to the statisticals independence of the "true" labels in the generative processes. Minor note: for other scoring rules, this may not be true, but improvement may be due to Stein type paradoxes, so can be argued not to be "real".

There is no such problem for endeavour (ii), as long as you use proper losses or scoring rules across variables - since the variables are in general dependent in the generative process. But of course you have the problem in the "combination" case, which means there´s no real case for the Mn-variate joint.

Further, one problem that @DilumAluthge otherwise excellent design overlooks: the "typical" performance metrics will not work if you have mixed discrete/continuous scitype, or mixed predictive distributions. You need composite losses here, there is no vanilla Brier for the mixed case. Thus, you would have to even further extend the "losses" interface, in a way where the scitype interface becomes compatible with the distributions scitype interface - e.g., ::ProbabilisticLoss{Vector{T}}.

fkiraly commented 4 years ago

PS:

I'm against calling probabilistic predictive models "Bayesian", simply since not all are.
I'm still in favour, and @ablaom may remember this discussion, to assign a scitype not only to data containers, but also to models. The scitype of a model is a composite type, with components indicating input and output scitype - where in this case the output would be mixed/multivariate and probabilistic.

cscherrer commented 4 years ago

Thanks @fkiraly . I mostly agree, with the exception of this:

If your data is i.i.d., then doing (i) does not make any sense, in general! Why: the marginal prediction is always at least as good, as any given joint, in terms of predictive likelihood (this is a mathematical theorem).

I'm not sure what you mean here. Whether you marginalize or not gives you two different things - the predictive density or the marginal predictive density. The first includes the value of the parameter, so you can't really compare these.

In any case, the predictive density is interesting, but it's really just the beginning. The typical use case is to evaluate the pushforward of the predictions through some loss function and evaluate its expected value. We can then use this to make decisions (optimal ones, when the model is correct).

Order of operations matter here, and marginalizing before computing the loss will give incorrect results unless the loss happens to be linear.

DilumAluthge commented 4 years ago

Alright, so let's decide on a name for this new type of model. It seems like Bayesian probably isn't the best name.

Here is a list of some candidates:

GeneralProbabilistic
GeneralizedProbabilistic
JointProbabilistic - from Chad's suggestion above

DilumAluthge commented 4 years ago

Once we have picked a name, I can make a pull request to add this new type of supervised model. After that PR is merged, then I think @cscherrer will be able to continue work on the Soss.jl + MLJ.jl integration and BayesianLinearRegression.jl + MLJ.jl integration.

DilumAluthge commented 4 years ago

Also, it's natural to have a function
marginalize(joint::Distribution{Vector}) :: Vector{Distribution}
that just "forgets" the dependencies between the components.

This would be a nice helper function to have. I'm not sure where it should live.

Chad, is it possible to implement this function for arbitrary Distribution{Vector}? Or will we need to have specific methods for each specific Distribution?

cscherrer commented 4 years ago

I think it will have to be specific to each case.

Say we have X and y as usual, and also some parameters θ. For MCMC inference, θ will be represented as a collection, typically a Vector but it's also possible to have this as an iterator.

But that's not always the case. For multivariate normal models (and some others), we'll have the posterior distribution of θ in closed form. This will also be the case for models fit using variational inference.

If we have methods for each case, we should still be able to common methods outside MLJ, right? For different kinds of models and/or different PPLs? So for example, I should be able to say, "use this method when Soss is doing HMC", that kind of thing.

DilumAluthge commented 4 years ago

If we have methods for each case, we should still be able to common methods outside MLJ, right?

Definitely.

tlienart commented 4 years ago

Alright, so let's decide on a name for this new type of model. It seems like Bayesian probably isn't the best name.

Here is a list of some candidates:

GeneralProbabilistic

GeneralizedProbabilistic

JointProbabilistic - from Chad's suggestion above

Maybe Generative ? otherwise JointProbabilistic has my vote as it's the one that's most clearly differentiated from the existing Probabilistic

fkiraly commented 4 years ago

I'm not sure what you mean here. Whether you marginalize or not gives you two different things - the predictive density or the marginal predictive density. The first includes the value of the parameter, so you can't really compare these.

In any case, the predictive density is interesting, but it's really just the beginning. The typical use case is to evaluate the pushforward of the predictions through some loss function and evaluate its expected value. We can then use this to make decisions (optimal ones, when the model is correct).

@cscherrer, I´m making a mathematical statement.

Namely, as long as your data generating process is i.i.d., the expectation you mention (which you can´t compute but only estimate) is going to be lower for the marginal prediction than for the joint, for any joint distribution that is not a product of marginals.

Hence, there is no sense in trying to make probabilistic predictions that are non-trivially joint across test data points. In two senses:

it would be less performant
we know the "truth" is a product of marginals

fkiraly commented 4 years ago

Also, it's natural to have a function

marginalize(joint::Distribution{Vector}) :: Vector{Distribution}

that just "forgets" the dependencies between the components.

This would be a nice helper function to have. I'm not sure where it should live.

I´d say: distributions.jl

fkiraly commented 4 years ago

If we have methods for each case, we should still be able to common methods outside MLJ, right? For different kinds of models and/or different PPLs? So for example, I should be able to say, "use this method when Soss is doing HMC", that kind of thing.

I don´t think there needs to be any special design considerations for Bayesian predictive models. They can be just treated as any other model that produces a probabilistic prediction. Whatever that is, as long as it is a distribution object, is up to the method, e.g., empirical/discrete as MCMC spits out, or parametric, as variational or conjugate spit out.

fkiraly commented 4 years ago

Regarding naming: why can´t it just be Probabilistic with prediction type being multi-variate?

DilumAluthge commented 4 years ago

Regarding naming: why can´t it just be Probabilistic with prediction type being multi-variate?

See:

According to the documentation, when the predict method is called on a Probabilistic model, it must return an AbstractVector whose elements are Distributions, with one distribution for each row of the input features X that were passed to predict. As Chad has explained, there is no correct way for a Soss model to implement a predict method that outputs a vector of distributions. Therefore, these model cannot be subtypes of Probabilistic.

fkiraly commented 4 years ago

As Chad has explained, there is no correct way for a Soss model to implement a predict method that outputs a vector of distributions

I see now why you think that, but I don´t really follow the key point of the argument, i.e., that a soss model can't implement a predict method that outputs a vector a distributions.

Can you please specify the mathematical scitype of the output? Why is that not a vector of distributions? Trying to understand the logical implication here, which is simply a statement about types.

cscherrer commented 4 years ago

Namely, as long as your data generating process is i.i.d., the expectation you mention (which you can´t compute but only estimate) is going to be lower for the marginal prediction than for the joint, for any joint distribution that is not a product of marginals.

You're talking apples and oranges here. The benefit of computing expected loss isn't that it's "small", but that it correctly propagates parameter uncertainty through the predictions.

Now, it's true that some metrics are defined in terms of the marginals. In that case, of course you need to marginalize before doing the computation.

Hence, there is no sense in trying to make probabilistic predictions that are non-trivially joint across test data points. In two senses:

it would be less performant

Again, apples and oranges.

we know the "truth" is a product of marginals

This is incorrect. Correlated parameters yield correlated predictions, even if the response values are modeled as exchangeable.

This would be a nice helper function to have. I'm not sure where it should live.

I´d say: distributions.jl

It could make sense for the function signature to be there, but the question here is about the implementation. @DilumAluthge I'd say this method should be defined wherever the distribution itself is defined.

I don´t think there needs to be any special design considerations for Bayesian predictive models. They can be just treated as any other model that produces a probabilistic prediction. Whatever that is, as long as it is a distribution object, is up to the method, e.g., empirical/discrete as MCMC spits out, or parametric, as variational or conjugate spit out.

I agree, in terms of the interface. But we'll need implementations specialized to different cases.

Regarding naming: why can´t it just be Probabilistic with prediction type being multi-variate?

In principle, yes. I understood MLJ to be have strong requirements in terms of shapes. Is there a simple example you can point me to with a design matrix X and response vector y (one for each row of X) where responses are treated as multivariate?

I see now why you think that, but I don´t really follow the key point of the argument, i.e., that a soss model can't implement a predict method that outputs a vector a distributions.

It's not a question of Soss's capabilities. Emphasis is on correct, not Soss.

Can you please specify the mathematical scitype of the output? Why is that not a vector of distributions? Trying to understand the logical implication here, which is simply a statement about types.

The distinction here is between Distribution{::Vector} and Vector{::Distribution}. I'll be writing up details more concretely, maybe that will make this easier to follow.

fkiraly commented 4 years ago

we know the "truth" is a product of marginals

This is incorrect. Correlated parameters yield correlated predictions

Can you kindly just write down, in math, what you even mean here, or what precisely you think is incorrect.

In the i.i.d. model of supervised learning, the test set consists of i.i.d samples $(X_1,Y_1), ..., (X_M,Y_M)$, where $X_i$ and $Y_i$ are random variabes with "data frame row" type. By definition, the perfect prediction is the conditional law, of (Y_1,...,Y_M)|(X_1,...,X_M). Which, by the assumption, that everything is i.i.d., is just the product of the conditional laws of Y_i|X_i.

Thus the "true" (or perfect) prediction is a product of marginals. q.e.d.

fkiraly commented 4 years ago

It's not a question of Soss's capabilities. Emphasis is on correct, not Soss.

would you then mind to spell out the argument, i.e., what you mean by a "correct" model?

The distinction here is between Distribution{::Vector} and Vector{::Distribution}. I'll be writing up details more concretely, maybe that will make this easier to follow.

I know that this is the key scitype difference, and I am not conflating this.

The other key difference, at danger of conflation, is variables vs samples. Rows vs columns in a data frame, if you so will. Would you kindly mind specifying precisely, optimally in math, what form the prediction takes? And what the "joint" distribution is over?

DilumAluthge commented 4 years ago

One really nice thing about Monte Carlo is that integration over a given sampled variable just means ignoring its output. So sampling looks like

Sample from w | y, α, β

Sample from ynew | w, β

Ignore the w

The ynew components are independent, conditional on w. But they are not marginally independent. In a given sample, the components must share a w, or the result will be invalid. Each component on its own will be fine, but taken together they can be way off.

fkiraly commented 4 years ago

apples and oranges.

To specify which type of fruit, vegetable, or animal you mean: may I kindly ask you to specify what precise mathematical object you wish for a posterior and/or predictive distribution of, given the setting of i.i.d. supervised learning that I assume we are discussing here. Just to make sure we shouldn´t be talking about potatoes.

fkiraly commented 4 years ago

One really nice thing about Monte Carlo is that integration over a given sampled variable just means ignoring its output. So sampling looks like

    Sample from w | y, α, β
    Sample from ynew | w, β
    Ignore the w

The ynew components are independent, conditional on w. But they are not marginally independent. In a given sample, the components must share a w, or the result will be invalid. Each component on its own will be fine, but taken together they can be way off.

@DilumAluthge, I´ve seen this - but this is ambiguous, since you haven´t defined your symbols. What is ynew, what is its type, how does it relate to the symbols in the common supervised learning setting.

cscherrer commented 4 years ago

Can you kindly just write down, in math, what you even mean here, or what precisely you think is incorrect.

Sure. https://github.com/alan-turing-institute/MLJ.jl/issues/552#issuecomment-634366688

In the i.i.d. model of supervised learning, the test set consists of i.i.d samples $(X_1,Y_1), ..., (X_M,Y_M)$, where $X_i$ and $Y_i$ are random variabes with "data frame row" type. By definition, the perfect prediction is the conditional law, of (Y_1,...,Y_M)|(X_1,...,X_M). Which, by the assumption, that everything is i.i.d., is just the product of the conditional laws of Y_i|X_i.

People use "i.i.d." is this context very loosely. Even here, you're implicitly conditioning on X. And in almost all models, the y values depend on the value of the parameters. When people say "i.i.d." they almost always mean "exchangeable", or conditionally independent, given X and the parameters.

I´ve seen this - but this is ambiguous, since you haven´t defined your symbols. What is ynew, what is its type, how does it relate to the symbols in the common supervised learning setting.

It's not ambiguous, you just need to do the reading.

I'd like to move this forward and would love for you to help if you're interested. But honestly, this is not helping.

DilumAluthge commented 4 years ago

https://github.com/cscherrer/BayesianLinearRegression.jl/blob/master/README.md

DilumAluthge commented 4 years ago

Also, the Bishop textbook is available free on Microsoft's website. Go to https://www.microsoft.com/en-us/research/people/cmbishop/ and click on "Pattern Recognition and Machine Learning by Christopher Bishop".

fkiraly commented 4 years ago

I´ve seen this - but this is ambiguous, since you haven´t defined your symbols. What is ynew, what is its type, how does it relate to the symbols in the common supervised learning setting.

It's not ambiguous, you just need to do the reading.

I´m very sorry to insist, but it is ambiguous since you have not defined the type of ynew. Is this a single test data point? Or are these multiple? It seems quite crucial given the discussion and the difference betwen the joints (i) and (ii) that I outline above.

No amount of reading of other sources will give me clarity about your personal intention at the meaning of the symbol.

I also think I am familiar with Bayesian linear regression and conjugate predictive posteriors.

I'd like to move this forward and would love for you to help if you're interested. But honestly, this is not helping.

I hope to help - in a way - by trying to make clear what it is you are suggesting. Let´s be friendly and constructive :-)

Sorry if my comments came across otherwise - let´s separate disagreement in the matter (e.g., whether a mathematical statement is correct or not) from disapproval of a person. Apologies if any formulation of mine gave a different impression.

fkiraly commented 4 years ago

Also, the Bishop textbook is available free on Microsoft's website.

Thanks - I already have it and I read it.

fkiraly commented 4 years ago

People use "i.i.d." is this context very loosely. Even here, you're implicitly conditioning on X. And in almost all models, the y values depend on the value of the parameters. When people say "i.i.d." they almost always mean "exchangeable", or conditionally independent, given X and the parameters.

Sorry, but exchangeability and i.i.d., as conditions, are not identical. These are different mathematical properties/assumptions.

For an ordered tuple of random variables (Z_1,...,Z_N), Exchangeability means: the law of the tuple (Z_pi(1), ..., Z_pi(N)), where pi is a permutation of the integers 1,...,N, does not depend on pi. i.i.d.-ness means: the Z_i are (mutually) independent, and follow the same law.

A counterexample for exchangeable variables being i.i.d. is as follows: Let X_1,X_2,X_3 be i.i.d. standard normal. Let Zi = sum{i\neq j} X_j. Then (one can show that) the tuple (Z_1,Z_2,Z_3) is exchangeable, but not i.i.d.

The converse holds though - an i.i.d. tuple is always exchangeable.

fkiraly commented 4 years ago

conditionally independent

I´m not sure what conditional independence you are referring to here, can you explain? What is assumed to be conditionally independent of what?

DilumAluthge commented 4 years ago

Let's take a really silly model as an example.

We have our training features matrix X_train, which is a matrix of floats with n_train rows and p columns.

We have our training labels vector y_train, which is a column vector of floats with n_train elements.

Our model will have a single scalar parameter θ. The domain of θ will be the set {1, -1}

Use the notation y_i to mean the i-th element of y. Use the notation X_i to mean the i-th row of X.

Let's define our likelihood: p(y_i | θ, X_i) as follows.

If θ = 1, then there is a 50% chance that y = 2 and a 50% chance that y = 3
If θ = -1, then there is a 50% chance that y = -2 and a 50% chance that y = -3

Now suppose that the posterior distribution p(θ | y, X) has the following form:

There is a 50% chance that θ = 1 and a 50% chance that θ = -1.

Now I receive a new features matrix X_test, which is a matrix of floats with n_test rows and p columns. I don't know what the y_test values are, so I want to predict them. That is, I want to understand the distribution p(y_test | X_test, y_train, X_train). I can sample from this distribution as follows.

Sample (e.g. use MCMC) from p(θ | y_train, X_train).
Take the value of θ from step 1, and use it to sample from p(y_test | θ, X_test) = p(y_test | θ, X_test, y_train, X_train). More specifically, for i in {1, 2, ..., n_test}, I sample from p(y_test_i | θ, X_test_i) = p(y_test_i | θ, X_test_i, y_train, X_train).
Ignore the θ

To be clear, X_test is a matrix of floats with n_test rows and p columns. y_test is a column vector of floats with n_test elements. I use X_test_i to denote the ith row of X_test. I use y_test_i to denote the ith row of y_test.

Let's do this. Suppose that n_test = 2. I sample from p(θ | y_train, X_train). I know that there is a 50% chance that I get 1, and a 50% chance that I get -1. Suppose that I get θ = 1. Now, for i in {1, 2}, I draw a sample from p(y_test_i | θ, X_test_i). Since θ = 1, I know that for each row, there is a 50% chance that I get 2 and a 50% chance that I get 3. Suppose that I end up with the sample y_test_1 = 2 and y_test_2 = 3; in other words, I get the sample y_test = [2, 3]. So my sample is θ = 1 and y_test = [2, 3]; this is a sample from the distribution p(y_test, θ | X_test, y_train, X_train). I throw away the θ, and I am left with the sample y_test = [2, 3]; This is a sample from the distribution p(y_test | X_test, y_train, X_train).

y_test_1 and y_test_2 are independent conditioned on θ. If θ = 1, then y_test_1 and y_test_2 each have the same distribution: 50% chance of being 2 and 50% chance of being 3. If θ = -1, then y_test_1 and y_test_2 each have the same distribution: 50% chance of being -2 and 50% chance of being -3. If you already know the value of θ, then getting the value of y_test_1 gives you no information on y_test_2, and vice versa.

But y_test_1 and y_test_2 are not independent. For example, if you don't know the value of θ, and I tell you that y_test_1 = 2, then you know that y_test_2 must be either 2 or 3. So, if you don't know the value of θ, then getting the value of y_test_1 gives you information on y_test_2, and vice versa.

cscherrer commented 4 years ago

Sorry, but exchangeability and i.i.d., as conditions, are not identical. These are different mathematical properties/assumptions.

Exactly. You said

If your data is i.i.d., then doing (i) does not make any sense, in general!

And my point is that if the y values are really i.i.d., then X doesn't matter. And Bayesian analysis is right out! If your y values depend on an X and a θ, then they''re not i.i.d.

People still use the term in this case, as you seemed to be doing.

I´m not sure what conditional independence you are referring to here, can you explain? What is assumed to be conditionally independent of what?

de Finetti's Theorem makes a connection between conditional independence and exchangeability. In this case the y values are conditionally independent of each other, given X and θ.

But now I can't help but ask... @fkiraly , what exactly is your goal and interest in this?

It seems to me you've come here with an a priori assumption that there's no merit in this approach. I've done my best to be clear, but you're still asking us to repeat things we've already said (e.g., ctrl-f for "given").

Would you really like to understand this, or are you just hoping to convince us it's a bad idea?

I'll be writing up a more concrete example that I hope will help. I'm sorry my patience has worn thin, but we're just going around in circles.

fkiraly commented 4 years ago

But now I can't help but ask... @fkiraly , what exactly is your goal and interest in this?

I try to help you make your ideas more precise, and - more generally - the MLJ project with interface design decisions.

What I think is happening here, in my opinion, is that you are conflating certain mathematical objects (e.g., generative distributions and posterior distributions) which gets us walking in circles when trying to suggest a "sensible" software interfae.

I´m trying to make the discussion more precise, since I´m not entirely sure whether the discussion is based on a misunderstanding - "approach without merit", as you provocatively formulate - or whether there is something to learn here or the MLJ interface, and it´s just a bit buried under a superficial layer of imprecision.

If I may try to explain what I think is going on here, as far as I can see. Please correct me if I´m wrong.

You want MLJ support joint predictive posteriors. So far so good - a goal I would agree in. But there´s two ways the posteriors could be joint - (i) across samples, and (ii) across variables. Since my first post in this thread, I don´t think you have answered or tried to help establish which of these you want. I also gave you my opinion that (i) might not necessarily be helpful in an i.i.d. setting, or desirable to have in a toolbox like MLJ, while (ii) certainly is.

Down the line, I feel there is also confusion between

assumptions that the Bayesian model makes about the generative data model ("reality")
the generative data model - "reality"

It is not an uncommon confusion, but a confusion nonetheless to conflate statements that a model/algorithm makes with what can be generally assumed.

For software design this is important, because it is conneted to the question of attributing an output to an algorithm/model or to the framework. Obviously, you wouldn´t want statements or objects that are algorithm specific in the framework.

For example, a posterior will always be a "statement" subject to a model assumption, therefore model specific. In software, this thus needs to be associated with the model, not the framework. And it seems to me you want the framework interface to be subject to posteriors from a very specific Bayesian model (soss or otherwise).

More concretely, in your last post when you say: "And Bayesian analysis is right out! If your y values depend on an X and a θ, then they''re not i.i.d." it seems to me you are conflating two things:

the generative data model, where the (X_i,Y_i) are well i.i.d., which has nothing to do with any model
the model predictions - more precisely, the joint conditionals on the model, and the model´s predictions (or inferences) which of course won´t be i.i.d., since they are all conditional on the model.

This is compounded by the fact that you are not very precise in saying what mathematical objects you are referring to. Dropping indices, not saying what objects are, switching notation mid-paragraph, or using synomously mathematical concepts that are different, all that doesn´t help.

In summary, I´m all for exploring the connection interface between MLJ and the Bayesian modelling paradigm, but we ought to be really precise - both in math and in algorithmics/software, as the devil is in the details. We could start with defining the posterior you want the model to produce - trying to specify and separate:

the data model (model agnostic)
the theoretical predictive posterior that you are referring to
the posterior produced by the Bayesian algorithm, which in general (especially in MCMC) will not be identical with the theoretical posterior

DilumAluthge commented 4 years ago

But there´s two ways the posteriors could be joint - (i) across samples, and (ii) across variables.

We are discussing joint across samples, i.e. joint across indices in the test set.

Can you take a look at my toy example above, and let me know if there is anything you disagree with? My toy example demonstrates that y_test_1 and y_test_2 are independent conditioned on θ, but y_test_1 and y_test_2 are not independent.

cscherrer commented 4 years ago

Thanks @fkiraly , this is much more clear.

But there´s two ways the posteriors could be joint - (i) across samples, and (ii) across variables. Since my first post in this thread, I don´t think you have answered or tried to help establish which of these you want.

I've tried to make this clear. Even in my May 24 example before you arrived, the response is univariate. We're talking about dependent values across samples.

Of course we could also have multivariate response, in which case those dependencies need to be represented as well. I've been assuming MLJ can already do this.

I also gave you my opinion that (i) might not necessarily be helpful in an i.i.d. setting, or desirable to have in a toolbox like MLJ, while (ii) certainly is.

Right, but we're not talking about i.i.d. models. We're talking about models with exchangeable responses. It seems to me you're conflating these.

Down the line, I feel there is also confusion between

assumptions that the Bayesian model makes about the generative data model ("reality")

the generative data model - "reality"

We've mostly been talking about the former, though of course "all models are wrong", and one of the goals of posterior predictive checks is to characterize this wrongness.

This is compounded by the fact that you are not very precise in saying what mathematical objects you are referring to.

Github's not very LaTeX-friendly. Here's an excerpt from Bishop's book that might be helpful:

bishop-predictive-distribution

He uses t for the response instead of y (always a bit jarring to me). And when we use MCMC, we can only approximate this using our sample. [I've advocated for iterator-based samplers instead of arrays, which would allow more samples after the fact for a better approximation. But that's another story]

DilumAluthge commented 4 years ago

This would be a nice helper function to have. I'm not sure where it should live.

I´d say: distributions.jl

It could make sense for the function signature to be there, but the question here is about the implementation. @DilumAluthge I'd say this method should be defined wherever the distribution itself is defined.

I think that the empty generic function stub can go in MLJModelInterface.jl. So MLJModelInterface.jl will have the following single line of code only:

function marginalize end

And then I think that each specific method definition can go wherever the relevant distribution is defined.

For example:

If Soss.jl defines MySossDistribution, then the method MLJModelInterface.marginalize(dist::MySossDistribution) will live in Soss.jl.
If Turing.jl defines MyTuringDistribution, then the method MLJModelInterface.marginalize(dist::MyTuringDistribution) will live in Turing.jl.
If Example.jl defines MyExampleDistribution, then the method MLJModelInterface.marginalize(dist::MyExampleDistribution) will live in Example.jl.
If Distributions.jl defines SomeJointDistribution, then the method MLJModelInterface.marginalize(dist::SomeJointDistribution) will live in Distributions.jl.

fkiraly commented 4 years ago

Can you take a look at my toy example above, and let me know if there is anything you disagree with? My toy example demonstrates that y_test_1 and y_test_2 are independent conditioned on θ, but y_test_1 and y_test_2 are not independent.

Thanks, @DilumAluthge. You are not entirely consistent with notation (dropping the index i here and there), but I think I understand what you are saying - you are computing the predictive posterior of the "integrating out" kind, using MCMC, and obtaining a posterior distribution across samples which is not independent, correct?

I´ll comment on that in a moment in a more extensive post.

One secondary point on your example: I think for the argument to be pertinent, you would need to separate the algorithm that computes the posterior (MCMC) from the "theoretical" posterior, these are in general not the same. However, the property of "non-independence" between samples would also hold for the theoretical one, so I´ll talk about that "cleaned" version of your argument below.

fkiraly commented 4 years ago

@cscherrer, please be constructive in your responses.

As I said:

I´m familiar with the contents of the Bishop book
I´m familiar with Bayesian predictive posteriors

There´s no need to copy-paste to me generic sections from the Bishop book, I would kindly ask you to comment on the mathematical specifics on my question.

Right, but we're not talking about i.i.d. models. We're talking about models with exchangeable responses. It seems to me you're conflating these.

"we're not talking about i.i.d. models" is an unjustified assumption - perhaps you aren´t.

The statistical data setting that MLJ, and for that, most other ML toolbox packages assumes, is i.i.d. data.

Further, I feel that both of you are conflating:

the statistical data setting, without any model being present - this is i.i.d.
distributions that drop out of a model, e.g., a Bayesian predictive posterior - these may or may not be i.i.d.

I am asking you:

please specify the precise output object that you would like your model to return - optimally, using mathematical symbols. If this is a belief posterior, say over what object the belief distribution is.
please use clean mathematical notation, well-defined symbols/concepts. Optimally, you also refer to the symbols I introduced, or the cases (i) or (ii) I defined. You can of course also use your own symbols, but please be clear and specific.

The page you copied is generic in the sense that it describes a specific kind of predictive posterior for a Gaussian distributed model, and unhelpful in the sense that it does not make reference to my specific asks or in a way that links notation used.

JuliaAI / MLJ.jl

Correlated predictions for Probabilistic models #552

My proposal

Name of this model type

The fit method

The predict method

Performance metric example

Possible questions

The `fit` method

The `predict` method