Open ablaom opened 5 years ago
@ablaom sorry about the delay.
The best option would be to implement LinearAlgebra.norm
for Distributions. PR welcome, but you can do that in your own repo (this is technically type piracy, but if it's not implemented in Distributions, it's fine)
@fkiraly Do you want to suggest a shortlist of distributions for which we want to implement the L2 norm aka LinearAlgebra.norm( •, 2)
? Here are all the final types of Distributions.Distribution
:
MixtureModel
UnivariateGMM
MvNormal
MvNormalCanon
Arcsine
Bernoulli
Beta
BetaBinomial
BetaPrime
Binomial
Biweight
Categorical
Cauchy
Chi
Chisq
Cosine
Dirichlet
DirichletMultinomial
DiscreteUniform
MvLogNormal
Distributions.GenericMvTDist
EdgeworthMean
EdgeworthSum
EdgeworthZ
EmpiricalUnivariateDistribution
Epanechnikov
Erlang
Exponential
FDist
Frechet
Gamma
GeneralizedExtremeValue
GeneralizedPareto
Geometric
Gumbel
Hypergeometric
InverseGamma
InverseGaussian
InverseWishart
KSDist
KSOneSided
Kolmogorov
Laplace
Levy
LocationScale
LogNormal
Logistic
Multinomial
NegativeBinomial
NoncentralBeta
NoncentralChisq
NoncentralF
FisherNoncentralHypergeometric
WalleniusNoncentralHypergeometric
NoncentralT
Normal
NormalCanon
NormalInverseGaussian
Pareto
Poisson
PoissonBinomial
Rayleigh
Semicircle
Skellam
SymTriangularDist
TDist
TriangularDist
Triweight
Truncated
Uniform
VonMises
VonMisesFisher
Weibull
Wishart
PR welcome. Also since this is a new feature, a page in the documentation would be great
Sure - I'll try to make the shortlist short...
Composites: MixtureModel Truncated
Continuous atoms: Beta Chi Chisq Exponential Gamma GeneralizedExtremeValue GeneralizedPareto Gumbel Laplace Logistic Normal Pareto Uniform VonMises Weibull
Discrete atoms: Bernoulli Categorical Dirichlet DiscreteUniform Geometric Hypergeometric Multinomial NegativeBinomial Poisson Skellam
Though, having thought about the matter a little bit: l2-norm is useful - but what would be even more useful is squared integral of pdf between 0 and x, for any x. Same for cdf.
In addition, I'm not sure whether LinearAlgebra.norm(d) is the best way to do this. That is because it's the l2 norm of the pdf, rather than of the distribution - that would be an ill-defined concept.
After all, you could take the l2 norm of any distribution defining function (if exists), pdf, cdf, mgf, etc, and the result would be different. Why should one single out pdf, given that it is merely one of multiple ways to uniquely define a distribution, and given that it doesn't even exist for mixed distributions?
Also, would it not be more natural if something like l2norm(d), or similar, gives the distribution of the random variable |X|_2, where X is distributed according to d? That is, returns a distribution object rather than a number?
I'm not saying that such a behaviour is something I'd particularly be interested in, just that it seems to me like the more natural behaviour of applying functions to distribution objects.
Ok yes I see your point, in that case a new function could be created, something like dist_norm
.
It will have to appear in the documentation
I might be misunderstanding, but wouldn't the Brier loss simply be:
function brier_loss(d::Distribution, x0)
m = mean(d)
return var(d) + m*(m-2*x0) + x0
end
I'm not opposed to having it here, but it does seem that scoring rules like this could also go in a separate package.
@simonbyrne - no, that's not the Brier loss.
Brier loss would be
function brier_loss(d::Distribution, x0)
return -2* pmf(d,x0) + l2norm(d)^2
end
and pdf for absolutely continuous ones.
@simonbyrne or, is your statement that the two are always, analytically, identical? (proof?)
@simonbyrne it's actually not, here's a disproof: your score is linear in x0 The Brier score (as defined by me) is linear in pmf(x0), and doesn't otherwise depend on x0. Hence the Brier score is in general is not linear in x0, so it can't be identical.
Ignore my comment above, I misunderstood what was being proposed.
From what I can tell, Brier score is really only defined for binary outcomes (i.e. Bernoulli predictions). It seems like there would be many ways one could generalize it to other distributions, e.g.
brier_loss(d, x0) = (x0 - mean(d))^2
would be an obvious candidate.
Additionally, from what I can tell, I think your definition off by a linear scaling and offset: e.g. for the Bernoulli case, if x0 == 0
, then your expression is equal to -2*(1-p) + p^2 + (1-p)^2 == (-2 + 2p) +p^2 + (p^2 -2p + 1) == 2p^2 -1
(which should be p^2
).
Another generalization which would have your desired property of depending only on x0
through the pdf, would be
brier_score(d, x0) = (1 - pdf(d, x0))^2
though this obviously wouldn't work for continuous distributions.
@simonbyrne I think we should be systematic - there are three high-level questions here: (i) whether the original Brier/squared loss is important/relevant (ii) whether distributions functionality supporting it should be implemented (iii) whether your alternatives are reasonable/viable alternatives - this is a theory question rather than an implementation question, primarily.
I think (i) yes, (ii) yes, and (iii) no - and in addition you are probably not too familiar with probabilistic losses, and maybe a little confused. I hope you do not take this too negatively. I'll try to explain theory below.
Regarding (i), I believe this is a standard thing, at least for the classification setting. Brier is implemented by relevant packages such as sklearn and mlr, and it one of multiple standard ways to train, or evaluate probabilistic classification models. For regression models, so are the generalizations, but these are more rare since common packages do not have good probabilistic interfaces (mlj does!)
The answer to (ii) is of course up to Distributions.jl devs to decide, but I was of the impression that their feeling was "yes".
Regarding (iii), I think we need to clarify a few points. One can't just arbitrarily define losses for probabilistic predictions - they need to be sensibly measuring how close the prediction is to the truth, and properness is one such "reasonable" property.
The alternatives you propose are not proper for the multi-class classification case, or the regression case. For binary classification, it turns out that -2 p(y) + |p|_2^2, and the other (commonly used) expression (1-p(y))² happen to be the same up to scaling and offset, so they measure the same.
But this is not true for the general case - one simple way to see this is that (1-p(y))² doesn't make too much sense for continuous pdf which are not upper bounded by 1.
For the same reason, I believe even in the binary classification case the expression -2 p(y) + |p|_2^2 is more helpful, since that is more or less the only one which generalizes to all relevant cases (except mixed distributions, but that's another story).
The answer to (ii) is of course up to Distributions.jl devs to decide, but I was of the impression that their feeling was "yes".
I don't really mind, but my point is that there is nothing really that special about having such rules in Distributions.jl: they could easily go in another package (say ScoringRules.jl, or be incorporated into LossFunctions.jl). Indeed, it is often desirable to work in smaller packages, since it is much easier and quicker to iteratively develop.
In terms of implementation, the main question is how you would handle cases where no closed-form solution exists?
That is a good point about propriety, and your L2 loss definition does match up with Selten characterization.
@simonbyrne, great that we are on one page regarding theory!
Let's chat about implementation then. If you want to support computation of the key proper scoring rules (or proper losses, with the sign convention more common in ML) for the continuous case, you have two options that I can see: (i) on-demand symbolic integration based on symbolic expressions that the distributions provide (ii) the distributions provide explicit expressions for integrals etc that appear in the scoring rules. For squared/Brier, this is the 2-norm of the pdf.
Number (i) is infeasible i.m.o., since you need to write a symbolic computation engine à la Mathematica from scratch, and it seems overkill.
Another alternative, numerical integration, isn't really one (I would argue), since it can be arbitrarily wrong: that's not only problematic if you have to do it for each data point, it's also a no-go if you want to use the losses for unbiased evaluation (where unbiased is not in the statistical, but in the empirical sense).
I think it is fair to say that symbolic operations are far out of scope of this package.
We could certainly have a pdfnorm(d::Distribution)
(or something similar) function for functions where it can be calculated easily (it would throw a MethodError
in other cases).
yes, exactly!
Wishing for a method
l2norm(d::Distribution)
that returns the L2 norm of the probability density function ford
. Our use case is computing the Brier loss and integrated square loss for machine learning models learning probability distributions: MLJ issue #34