Variance/bias-corrected entropy estimators

kahaaga commented 1 year ago

Warning: wall of text

Context and literature review

The method entropy(::Shannon, ::ProbabilitiesEstimator, args...), although we don't state it anywhere, is in the literature referred to as the "naive", "maximum likelihood", "plug-in" or "empirical" estimator of Shannon entropy. It is simply the entropy of the empirical distribution, as determined by relative frequency counts. Denote this estimator H_naive.

It is a well-known fact that this estimate is negatively biased. It is also shown that the variance of this estimator has an upper bound (a function of the sample size), and has a certain asymptotic bias that depends on 1) the number of outcomes/states with nonzero probability and 2) the sample size (e.g. Paninski, 2003). Consequently, there exists not just one, but many entropy estimators that tries to correct this bias and/or minimise the estimator variance of the estimated entropy. Some existing estimators are:

The empirical/naive/plug-in/maximum-likelihood estimator (Paninski, 2003). This is H_naive, which we implement.
Jackknife estimator (Paninski, 2003) This is just the application of the generic jackknife leave-on-one-out principle.
The Miller-Madow estimator, which is H_mm = H_naive + (|X|-1)/(2n), where n is the sample size and |X| is the number of nonzero-probability bins.
Chao-Shen estimator (Chao & Shen, 2003)
Shrink entropy estimator (e.g. Hausser & Strimmer, 2009)
Shurmann-Grassberger estimator (Schürmann, 2004)
Wolpert-Wolf estimator (Wolpert & Wolf, 1995)
Horvitz-Thompson estimator (Horvitz & Thompson, 1952)
Nemenman estimator (Nemenman et al, 2001)

EDIT: bias-corrected estimators for Rényi entropy:

Acharya, J., Orlitsky, A., Suresh, A. T., & Tyagi, H. (2016). Estimating Rényi entropy of discrete distributions. IEEE Transactions on Information Theory, 63(1), 38-56.

There are many papers detailing/summarizing different discrete entropy (corrected) estimators. A nice, short example is Arora et al. (2022), in which I found the references for many of the estimators above.

It seems that it is pretty standard in many applications to use the corrected estimators (although not as prevalent in the NLD literature I've seen, which presents all these entropy-based complexity measures).

Moreover, corrections such as these appear not only in the isolated context of estimating entropy, but also become relevant when estimating e.g. discrete mutual information. I wanted to bring up the issue here, so that we can agree on an overall implementation strategy, before I spend a lot of time on it upstream.

How to implement this type of error correction?

To reiterature: entropy(::Entropy, ::ProbabilitiesEstimator) is in the literature referred to as the maximum-likelihood estimator of entropy. This holds true for any entropy type, not just Shannon. It is called the maximum likelihood estimator because it is based on the maximum likelihood estimate of the probabilities which it is based on (i.e. relative frequencies). Plugging these maximum-likelihood estimates of probabilities into any formula will give a plug-in/maximum-likelihood/naive estimator of whatever quantity it tries to compute. Upstream, for discrete (conditional) mutual information, it is implicitly this naive estimator that is used at the moment (but corrections are possible, and should be available).

The estimators referred to above all make corrections to discrete Shannon entropy. But the plug-in/maximum-likelihood/naive estimator is also equally valid for any other entropy. The other corrected estimators may not be valid estimators of other types of entropies (I presume, but haven't actually delved that deeply into the literature yet).

The question then is:

What is a good strategy for implementing these different estimators here?

One thing we definitely have to do straight away is to specify that we only provide the naive estimator and cite perhaps Paninski (2003) for that. We should also mention that probabilities return the naive/maximum-likelihood estimates of the probabilities (yes, I think "maximum likelihood" should appear explicitly, because it is so commonly used in the entropy literature).

Some options:

Introduce a `DiscreteEntropyEstimator type`

Just like for differential entropy, introduce a DiscreteEntropyEstimator abstract type, and let MLE <: DiscreteEntropyEstimator (short for for maximum likelihood estimator) MillerMadow <:DiscreteEntropyEstimator,JackknifeEntropy <: DiscreteEntropyEstimator,SchurmannGrassberger <: DiscreteEntropyEstimatorbe subtypes. Each of these subtypes has a mandatory fieldest::ProbabilitiesEstimator`.

This could be non-breaking by defining a default method entropy(e::Entropy, est::ProbabilitiesEstimator, args...) = entropy(e, MLE(est), args...). In practice, you'd then do

x = Dataset(rand(10000, 3))
est = # some estimator
entropy(Shannon(), est, x) # a wrapper around `entropy(e, MLE(est), x)`.
entropy(Shannon(), MLE(est), x) # Equivalent to the above
entropy(Shannon(), MillerMadow(est), x)
entropy(Shannon(), JackknifeEntropy(est), x)

entropy(Renyi(q = 2), est, x) # equivalent to the below
entropy(Renyi(q = 2), MLE(est), x) # MLE works for Renyi too, because it is the naive plug-in estimate
entropy(Renyi(q = 2), JackknifeEntropy(est), x) # Also works, because the jacknife estimator is completely generic variant of the naive estimate
entropy(Renyi(q = 2), MillerMadow(est), x) # throws an error, because  `MillerMadow` only works with Shannon

Introduce a field to `EntropDefinition`

We can also specify the corrective algorithm as a field to EntropyDefinition, presumably as type instances, as described above:

x = Dataset(rand(10000, 3))
est = # some estimator
entropy(Shannon(alg = MillerMadow(), base = 2), est, x)

However, I think this is way uglier, and doesn't make sense on a theoretical level, because an EntropyDefinition is just that - a definition. It has nothing to do with the estimation. Therefore, I vastly prefer the first option at the moment, because it is a direct analogue of the DifferentialEntropyEstimator.

Other suggestions are welcome.

Caveats

We should be careful about naming these estimator types, because some of them, like the jackknife estimator, are completely generic and not specific to entropy. To use those estimators upstream (e.g. for (conditional) mutual information), we shouldn't name it JackknifeENTROPYEstimator, but something that can be re-used elsewhere too. For the some estimators, this isn't a problem, because they only appear in the context of entropy estimation.

Summary

I might have much more to add here after I dive deeper into the literature and learn more, but this is an iterative learning experience for me, so it might take some time.

I'm tagging this with the 3.0 milestone, because, depending on how this is implemented, it might be breaking.

In the long run, this is important to settle on an API here., because it is also relevant for discrete estimators of (conditional) mutual information and other entropy-based quantities upstream. To keep it simple and avoid duplicate efforts, for now, upstream discrete estimators are purely based on the empirical/naive/MLE/plug-in estimator, and error correction is not possible at the moment.

References

Not a completely list, but here are some relevant references.

Paninski, L. (2003). Estimation of entropy and mutual information. Neural computation, 15(6), 1191-1253.
Arora, A., Meister, C., & Cotterell, R. (2022). Estimating the entropy of linguistic distributions. arXiv preprint arXiv:2204.01469.
Chao, A., & Shen, T. J. (2003). Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environmental and ecological statistics, 10(4), 429-443.
Hausser, J., & Strimmer, K. (2009). Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. Journal of Machine Learning Research, 10(7).
Schürmann, T. (2004). Bias analysis in entropy estimation. Journal of Physics A: Mathematical and General, 37(27), L295.
Wolpert, D. H., & Wolf, D. R. (1995). Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52(6), 6841.
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260), 663-685.
Nemenman, I., Shafee, F., & Bialek, W. (2001). Entropy and inference, revisited. Advances in neural information processing systems, 14.

Datseris commented 1 year ago

(I've removed the 3.0 milestone, because this can be done as a non breaking change since we have separated the definitions of entropies from discrete estimators)

kahaaga commented 1 year ago

I was looking into implementing these now, but came across something we need to resolve:

Many of these estimators are functions of the raw counts of the outcomes, not the normalised counts (i.e. probabilities). To accommodate this, I propose that we simply add another field frequencies to Probabilities, where frequencies[i] corresponds to probs[i]. This way, both counts and probabilities are readily available.

What do you think, @Datseris?

Datseris commented 1 year ago

This sounds like a big change because many internal functions already compute probabilities, mainly the fasthist so I can't even count from the top of my head how many files you would need to alter to correctly track this change "everywhere" :(

But I can't think of a different way. If you really need the true counts.

So let's go with this. This means that the internal normed argument of the Probabilities constructor must be dropped. Actually, this is a good way for you to find out which methods will fail. Just remove the normed as now we must never give pre-normed probabilities. The test suite will tell you how many fail.

Let's do this PR before anything else to keep things clean. I am reallhy worried about this change I fear it may break many things.

Datseris commented 1 year ago

I wouldn't name the field frequencies. I would name it counts. I always found frequency an odd word to use for integer numbers.

kahaaga commented 1 year ago

But I can't think of a different way. If you really need the true counts.

Yes, several of the estimators do require the raw counts. I can attempt a PR and see how involved it will be.

So let's go with this. This means that the internal normed argument of the Probabilities constructor must be dropped. Actually, this is a good way for you to find out which methods will fail. Just remove the normed as now we must never give pre-normed probabilities. The test suite will tell you how many fail.

Agreed.

Datseris commented 1 year ago

Given the current API, I am not sure how to estimate frequencies given arbitrary input x to Probabilities. I guess one method should dispatch to Array{<:AbstractFloat} and one method dispatches to Array{<:Int}. The first dispatch tries to somehow magically estimate raw counts by extracting count quantum = 1/minimum(p) and then muliplying everything with the quantum and rounding to integer. Second method assmes array contains counts already?

kahaaga commented 1 year ago

Hm. Do we actually have to estimate frequencies all the time? I think we can do

struct Probabilities
    probs::AbstractVector
    counts::Union{Nothing, AbstractVector}
end

This way, if isnothing(counts), then only MLEntropy can be used (or other estimators that operate directly on probabilities). On the other hand, if !isnothing(counts), then entropy estimators that demand raw counts can also be used.

The user doesn't even need to know that Probabilities stored counts. We just make sure to also include counts wherever possible. Trying to call entropy(WhateverFancyEstimator(Shannon()), est, x) will then error generically, because counts are not defined.

Datseris commented 1 year ago

Well there is nothing stopping us from estimating the counts with the method I described so why not do it alltogether. The user anyways will never know of the existence of the counts field.

kahaaga commented 1 year ago

Well there is nothing stopping us from estimating the counts with the method I described so why not do it alltogether. The user anyways will never know of the existence of the counts field.

For the integer version, it is straight-forward

# If given integer vector, then it is assumed that elements are counts of the different outcomes
Probabilities(x::AbstractVector{Int})

But for the float-version, I don't see how that would work. If I input say x = [0.1, 0.3, 0.4], and I want to convert it to a probability vector by normalizing, I have no idea how many counts underlie those initial fractions. To get actual counts, I'd need to also specify n (the total number of outcomes)?

kahaaga commented 1 year ago

To me, it seems like there should be three constructors.

Probabilities(::AbstractVector{<:Float}) (the current behaviour, leaves counts as nothing)
Probabilities(::AbstractVector{<:Int}). Treats the inputs as counts
Probabilities(::AbstractVector{<:Float}, ::Int). Like the first, but also defines counts, because the total number of observations is known

kahaaga commented 1 year ago

We could of course just create imaginary counts whose ratio respects the initial input data, but I feel uneasy doing so, because we're pretending to know information we don't have.

kahaaga commented 1 year ago

Ah, but the input data gives n automatically, so scaling like you proposed should work (up to rounding errors).

Datseris commented 1 year ago

We could of course just create imaginary counts whose ratio respects the initial input data, but I feel uneasy doing so, because we're pretending to know information we don't have.

That's what we should do. It's fine. Besides, we don't expect users to directly initialize Probabilities. Instead, they should give input data to the probabilities function.

kahaaga commented 1 year ago

Hey @Datseris,

Since we're moving to 3.0 due to the new infoestimator-stores-the-definition API, would it make sense to do something similar for the discrete info estimators? We have the old-style syntax

struct FancyDiscreteEst{I} <: DiscreteInfoestimator
    measure::I # the info measure, e.g. `Shannon()`
end

function information(est::FancyDiscreteEst{<:Shannon}, est::ProbabilitiesEstimator, x)
    probs = probabilities(est, x)
    # ...
end

Or, we could let the DiscreteInfoEstimator store the ProbabilitiesEstimator too, so that we get

struct FancyDiscreteEst{I, P <: ProbabilitiesEstimator} <: DiscreteInfoestimator
    measure::I # the info measure, e.g. `Shannon()`
    probest::P  # e.g. `CountOccurences()`
end

function information(est::FancyDiscreteEst{<:Shannon}, x)
     probs = probabilities(est.probest, x)
     # ....
end

Any preference?

kahaaga commented 1 year ago

I don't think it's a huge problem to have two different signatures for information - we already do, so I am slightly leaning towards the first alternative. The reason is that it is more pedagogic - one needs to pick both an entropy estimator AND a probabilities/frequencies estimator to estimate an entropy from data. That gets hidden a bit in the second alternative.

Datseris commented 1 year ago

DiscreteInfoEstimator does not depend on the probabilities estimation in any way so there is no reason to make it its field. So I vote no changes to be done.

kahaaga commented 1 year ago

DiscreteInfoEstimator does not depend on the probabilities estimation in any way so there is no reason to make it its field. So I vote no changes to be done.

Ok, then I just stick with information(est::DiscreteInfoEstimator, pest::ProbabilitiesEstimator, x).

JuliaDynamics / ComplexityMeasures.jl