alexpghayes / distributions3

Probability Distributions as S3 Objects
https://alexpghayes.github.io/distributions3/
Other
100 stars 16 forks source link

Add Empirical() distribution? #98

Open zeileis opened 1 year ago

zeileis commented 1 year ago

Question

For some of our practical meteorological applications we need to deal with probability distributions and probabilistic forecasts that are made via empirical distributions (so-called "ensembles" in the weather forecasting). So to enable all the nice infrastructure from distributions3 and topmodels we (= mostly @RetoStauffer with some input from me) have written a first draft of a distribution and corresponding d/p/q/r functions.

My feeling is that this distribution class is of general interest and could also be relevant in introductory statistics. So should we prepare a PR for distributions3?

Implementation strategy

The idea is that the empirical observations are handled internally like the parameters of a probability distribution. The moments and quantiles are computed from the empirical observations directly. The cdf() uses the empirical CDF (ECDF) and the pdf() can either use hist() or density().

Illustration

Packages: The current code is in topmodels but unexported at the moment.

install.packages("topmodels", repos = "https://R-Forge.R-project.org")
library("distributions3")

Set up a simple empirical sample and a shifted version with some extra observations to obtain another empirical sample:

x <- c(-3, -2, -1, -0.5, 0, 0.5, 1, 2, 3)
y <- c(0, 0, x + 3)
d <- topmodels:::Empirical(list(x, y))
d
## [1] "Empirical distribution (Min. -3, Max.  3, N = 9)" 
## [2] "Empirical distribution (Min.  0, Max.  6, N = 11)"

Moments and quantiles are straightforward:

mean(d)
## [1] 0.000000 2.454545
variance(d)
## [1] 3.562500 4.322727
quantile(d, 0.5)
## [1] 0.0 2.5

Random sampling is with replacement (i.e., corresponding to bootstrapping):

set.seed(0)
random(d, 10)
##      r_1 r_2 r_3 r_4 r_5 r_6 r_7 r_8 r_9 r_10
## [1,]   3   1  -2  -2  -1   0 0.5   1   0    3
## [2,]   1   0   3   6   0   2 5.0   4   2    4

The ECDF is uniquely defined but the PDF via histogram or kernel density depends on binning/bandwidth:

cdf(d, 0)
## [1] 0.5555556 0.2727273
pdf(d, 0, method = "hist") ## default method
## [1] 0.2222222 0.3636364
pdf(d, 0, method = "hist", breaks = 4)
## [1] 0.1666667 0.2272727
pdf(d, 0, method = "density")
## [1] 0.1978979 0.1271395
pdf(d, 0, method = "density", bw = 1)
## [1] 0.1894239 0.1378353
alexpghayes commented 1 year ago

Hmmm. This is interesting. Morally I feel like the Empirical distribution is a discrete distribution over the observed data with density 1 / n on each point. I'm open to this but I think that the implementation might be starting to blur the line between a distribution as a population, and a distributional estimate of that population.

zeileis commented 1 year ago

That was also my first impulse. Reto convinced me that for the data he was interested in, an aaproximated density would be more useful in practice. But given your reaction I would say we should make the discrete behavior the default. The only question remains if we allow histogram/kernel density to be available as alternatives "for convenience". Would you be open to that?

alexpghayes commented 1 year ago

The only question remains if we allow histogram/kernel density to be available as alternatives "for convenience". Would you be open to that?

I'm fairly opposed as currently implemented, because the class structure does not map on to the structure of the statistical objects. I think the right way to do this is to have distribution objects and distributional_estimate objects, and for these to be distinct. The idea being roughly that a distribution object corresponds to a single CDF F, and a distributional estimate corresponds to a single estimate F(X) of F for data X. This is essentially the same complaint I originally had about prodist(). In parametric settings, the distinctions might be otherwise minor, but in non-parametric settings they become more clear. A histogram or a kernel density estimate is very much a distributional estimate, but I don't there is a obvious corresponding distribution. More precisely, I think it is misleading to pick a semi-arbitrary non-parametric estimate and claim that it corresponds to an Empirical distribution. I understand this is splitting hairs to some degree.

In practice, I don't know a convenient way to implement a bunch of parametric distribution S3 objects, and then to also derive a bunch of distributional_estimate S3 classes that correspond to each distribution, but this kind of dual class system is my preference here. At the very least, distributions and distributional_estimates classes should have different print methods. Then things like KDEs and histograms should be implemented as standalone distributional estimators.