Closed kahaaga closed 1 year ago
(I've removed the 3.0 milestone, because this can be done as a non breaking change since we have separated the definitions of entropies from discrete estimators)
I was looking into implementing these now, but came across something we need to resolve:
Many of these estimators are functions of the raw counts of the outcomes, not the normalised counts (i.e. probabilities). To accommodate this, I propose that we simply add another field frequencies
to Probabilities
, where frequencies[i]
corresponds to probs[i]
. This way, both counts and probabilities are readily available.
What do you think, @Datseris?
This sounds like a big change because many internal functions already compute probabilities, mainly the fasthist
so I can't even count from the top of my head how many files you would need to alter to correctly track this change "everywhere" :(
But I can't think of a different way. If you really need the true counts.
So let's go with this. This means that the internal normed
argument of the Probabilities
constructor must be dropped. Actually, this is a good way for you to find out which methods will fail. Just remove the normed
as now we must never give pre-normed probabilities. The test suite will tell you how many fail.
Let's do this PR before anything else to keep things clean. I am reallhy worried about this change I fear it may break many things.
I wouldn't name the field frequencies
. I would name it counts
. I always found frequency an odd word to use for integer numbers.
But I can't think of a different way. If you really need the true counts.
Yes, several of the estimators do require the raw counts. I can attempt a PR and see how involved it will be.
So let's go with this. This means that the internal normed argument of the Probabilities constructor must be dropped. Actually, this is a good way for you to find out which methods will fail. Just remove the normed as now we must never give pre-normed probabilities. The test suite will tell you how many fail.
Agreed.
Given the current API, I am not sure how to estimate frequencies given arbitrary input x
to Probabilities
. I guess one method should dispatch to Array{<:AbstractFloat}
and one method dispatches to Array{<:Int}
. The first dispatch tries to somehow magically estimate raw counts by extracting count quantum = 1/minimum(p)
and then muliplying everything with the quantum and rounding to integer. Second method assmes array contains counts already?
Hm. Do we actually have to estimate frequencies all the time? I think we can do
struct Probabilities
probs::AbstractVector
counts::Union{Nothing, AbstractVector}
end
This way, if isnothing(counts)
, then only MLEntropy
can be used (or other estimators that operate directly on probabilities). On the other hand, if !isnothing(counts)
, then entropy estimators that demand raw counts can also be used.
The user doesn't even need to know that Probabilities
stored counts. We just make sure to also include counts wherever possible. Trying to call entropy(WhateverFancyEstimator(Shannon()), est, x)
will then error generically, because counts are not defined.
Well there is nothing stopping us from estimating the counts with the method I described so why not do it alltogether. The user anyways will never know of the existence of the counts
field.
Well there is nothing stopping us from estimating the counts with the method I described so why not do it alltogether. The user anyways will never know of the existence of the counts field.
For the integer version, it is straight-forward
# If given integer vector, then it is assumed that elements are counts of the different outcomes
Probabilities(x::AbstractVector{Int})
But for the float-version, I don't see how that would work. If I input say x = [0.1, 0.3, 0.4]
, and I want to convert it to a probability vector by normalizing, I have no idea how many counts underlie those initial fractions. To get actual counts, I'd need to also specify n
(the total number of outcomes)?
To me, it seems like there should be three constructors.
Probabilities(::AbstractVector{<:Float})
(the current behaviour, leaves counts
as nothing
)Probabilities(::AbstractVector{<:Int})
. Treats the inputs as countsProbabilities(::AbstractVector{<:Float}, ::Int)
. Like the first, but also defines counts, because the total number of observations is knownWe could of course just create imaginary counts whose ratio respects the initial input data, but I feel uneasy doing so, because we're pretending to know information we don't have.
Ah, but the input data gives n
automatically, so scaling like you proposed should work (up to rounding errors).
We could of course just create imaginary counts whose ratio respects the initial input data, but I feel uneasy doing so, because we're pretending to know information we don't have.
That's what we should do. It's fine. Besides, we don't expect users to directly initialize Probabilities
. Instead, they should give input data to the probabilities
function.
Hey @Datseris,
Since we're moving to 3.0 due to the new infoestimator-stores-the-definition API, would it make sense to do something similar for the discrete info estimators? We have the old-style syntax
struct FancyDiscreteEst{I} <: DiscreteInfoestimator
measure::I # the info measure, e.g. `Shannon()`
end
function information(est::FancyDiscreteEst{<:Shannon}, est::ProbabilitiesEstimator, x)
probs = probabilities(est, x)
# ...
end
Or, we could let the DiscreteInfoEstimator
store the ProbabilitiesEstimator
too, so that we get
struct FancyDiscreteEst{I, P <: ProbabilitiesEstimator} <: DiscreteInfoestimator
measure::I # the info measure, e.g. `Shannon()`
probest::P # e.g. `CountOccurences()`
end
function information(est::FancyDiscreteEst{<:Shannon}, x)
probs = probabilities(est.probest, x)
# ....
end
Any preference?
I don't think it's a huge problem to have two different signatures for information
- we already do, so I am slightly leaning towards the first alternative. The reason is that it is more pedagogic - one needs to pick both an entropy estimator AND a probabilities/frequencies estimator to estimate an entropy from data. That gets hidden a bit in the second alternative.
DiscreteInfoEstimator
does not depend on the probabilities estimation in any way so there is no reason to make it its field. So I vote no changes to be done.
DiscreteInfoEstimator does not depend on the probabilities estimation in any way so there is no reason to make it its field. So I vote no changes to be done.
Ok, then I just stick with information(est::DiscreteInfoEstimator, pest::ProbabilitiesEstimator, x)
.
Warning: wall of text
Context and literature review
The method
entropy(::Shannon, ::ProbabilitiesEstimator, args...)
, although we don't state it anywhere, is in the literature referred to as the "naive", "maximum likelihood", "plug-in" or "empirical" estimator of Shannon entropy. It is simply the entropy of the empirical distribution, as determined by relative frequency counts. Denote this estimatorH_naive
.It is a well-known fact that this estimate is negatively biased. It is also shown that the variance of this estimator has an upper bound (a function of the sample size), and has a certain asymptotic bias that depends on 1) the number of outcomes/states with nonzero probability and 2) the sample size (e.g. Paninski, 2003). Consequently, there exists not just one, but many entropy estimators that tries to correct this bias and/or minimise the estimator variance of the estimated entropy. Some existing estimators are:
H_naive
, which we implement.H_mm = H_naive + (|X|-1)/(2n)
, wheren
is the sample size and|X|
is the number of nonzero-probability bins.EDIT: bias-corrected estimators for Rényi entropy:
There are many papers detailing/summarizing different discrete entropy (corrected) estimators. A nice, short example is Arora et al. (2022), in which I found the references for many of the estimators above.
It seems that it is pretty standard in many applications to use the corrected estimators (although not as prevalent in the NLD literature I've seen, which presents all these entropy-based complexity measures).
Moreover, corrections such as these appear not only in the isolated context of estimating entropy, but also become relevant when estimating e.g. discrete mutual information. I wanted to bring up the issue here, so that we can agree on an overall implementation strategy, before I spend a lot of time on it upstream.
How to implement this type of error correction?
To reiterature:
entropy(::Entropy, ::ProbabilitiesEstimator
) is in the literature referred to as the maximum-likelihood estimator of entropy. This holds true for any entropy type, not just Shannon. It is called the maximum likelihood estimator because it is based on the maximum likelihood estimate of the probabilities which it is based on (i.e. relative frequencies). Plugging these maximum-likelihood estimates of probabilities into any formula will give a plug-in/maximum-likelihood/naive estimator of whatever quantity it tries to compute. Upstream, for discrete (conditional) mutual information, it is implicitly this naive estimator that is used at the moment (but corrections are possible, and should be available).The estimators referred to above all make corrections to discrete Shannon entropy. But the plug-in/maximum-likelihood/naive estimator is also equally valid for any other entropy. The other corrected estimators may not be valid estimators of other types of entropies (I presume, but haven't actually delved that deeply into the literature yet).
The question then is:
What is a good strategy for implementing these different estimators here?
One thing we definitely have to do straight away is to specify that we only provide the naive estimator and cite perhaps Paninski (2003) for that. We should also mention that
probabilities
return the naive/maximum-likelihood estimates of the probabilities (yes, I think "maximum likelihood" should appear explicitly, because it is so commonly used in the entropy literature).Some options:
Introduce a
DiscreteEntropyEstimator type
Just like for differential entropy, introduce a
DiscreteEntropyEstimator
abstract type, and letMLE <: DiscreteEntropyEstimator
(short for for maximum likelihood estimator)MillerMadow <:
DiscreteEntropyEstimator,
JackknifeEntropy <: DiscreteEntropyEstimator,
SchurmannGrassberger <: DiscreteEntropyEstimatorbe subtypes. Each of these subtypes has a mandatory field
est::ProbabilitiesEstimator`.This could be non-breaking by defining a default method
entropy(e::Entropy, est::ProbabilitiesEstimator, args...) = entropy(e, MLE(est), args...)
. In practice, you'd then doIntroduce a field to
EntropDefinition
We can also specify the corrective algorithm as a field to
EntropyDefinition
, presumably as type instances, as described above:However, I think this is way uglier, and doesn't make sense on a theoretical level, because an
EntropyDefinition
is just that - a definition. It has nothing to do with the estimation. Therefore, I vastly prefer the first option at the moment, because it is a direct analogue of theDifferentialEntropyEstimator
.Other suggestions are welcome.
Caveats
We should be careful about naming these estimator types, because some of them, like the jackknife estimator, are completely generic and not specific to entropy. To use those estimators upstream (e.g. for (conditional) mutual information), we shouldn't name it
JackknifeENTROPYEstimator
, but something that can be re-used elsewhere too. For the some estimators, this isn't a problem, because they only appear in the context of entropy estimation.Summary
I might have much more to add here after I dive deeper into the literature and learn more, but this is an iterative learning experience for me, so it might take some time.
I'm tagging this with the 3.0 milestone, because, depending on how this is implemented, it might be breaking.
In the long run, this is important to settle on an API here., because it is also relevant for discrete estimators of (conditional) mutual information and other entropy-based quantities upstream. To keep it simple and avoid duplicate efforts, for now, upstream discrete estimators are purely based on the empirical/naive/MLE/plug-in estimator, and error correction is not possible at the moment.
References
Not a completely list, but here are some relevant references.