JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems
MIT License
56 stars 14 forks source link

Finalize the syntax for `information` #325

Closed Datseris closed 11 months ago

Datseris commented 12 months ago

My problem now is that information is simply way too complex of a function. It can have up to 5 dispatches in CausalityTools.jl , see e.g.,

https://github.com/JuliaDynamics/CausalityTools.jl/issues/352

My argument is this is becoming too complex. Multiple dispach should not be used to operate on more than 3 arguments. Not because it is not possible. But because it becomes too complex for the human brain, and incredibly difficult to track in code.

Here is the call signature I propose:

information(info_estimator, probs_estimator, timeseries...)

This signature can only dispatch three times: one, on the information estimator that contains the definition of the measure estimated. Second, on the probabilities estimator that contains the outcome space to be used. In both of these cases, the reference definition or outcomespace can be used directly instead of an estimator. Lastly, information dispatches on the length of the timeseries (i.e., if timeseries... is only x, the ComplexityMeasures.jl is called. If it is x, y, ... then CausalityTools.jl is called). This dispatch instruction can go into the Dev Docs as well.

This minimizes dispatch and makes the code easier to track. It is fine to allow leaner versions of information, where e.g., probs_estimator can be skipped for a default value. But I am against allowing versions with more than 3 dispatch.

While reviewing #316 , it has already become very hard for me to track the source code. Given a call signature such as

perm_ent_y_q2 = entropy(Renyi(;q = 2.0), BayesianEstimation(), OrdinalPatterns(), y)

I am not sure where this would go or come from.


For this design to succeed, we only need to establish two rules:

  1. Probabilities estimators reference an outcome space. We have not done this yet, but it should be done after PR #316 is merged to not overcomplicate PR merging. (if we agree on this design)
  2. Measure estimators reference the definition they estimate. We have already successfuly done this in ComplexityMeasures.jl, but not yet in CausalityTools.jl.
Datseris commented 12 months ago

For CausalityTools.jl, the probs_estimator above can be a special type that not only wraps an outcome space, but also tells you whether you want to discretize/encode row-wise or column-wise before computing probabilities.

kahaaga commented 12 months ago

Measure estimators reference the definition they estimate. We have already successfuly done this in ComplexityMeasures.jl, but not yet in CausalityTools.jl.

I am in favor of this decision, and I am actively re-designing the CausalityTools.jl codebase to allow for this, both for the information measures and other measures. This will allow us to do (internally):

This unified signature would then work seamlessly with SurrogateTest, LocalPermutationTest, as well as with infer_graph. HOWEVER:

Second, on the probabilities estimator that contains the outcome space to be used. Probabilities estimators reference an outcome space.

This is not a good idea for CausalityTools.jl, and in fact it complicates things a lot. The probabilities estimators must be completely agnostic to where the counts that they estimate probabilities come from. The reason is that in CausalityTools, you get can counts both using Encodings and OutcomeSpaces to compute counts. Enforcing that OutcomeSpaces are coupled to the ProbabilitiesEstimators will require a non-intuitive and complicated workaround in CausalityTools.jl. This was the entire reason why we did #316 in the first place: because if the ProbabilitiesEstimators are coupled to OutcomeSpaces, it leads to problems for CausalityTools.

Before making any more redesign decisions, I think it is important that I finish up a basic skeleton for CausalityTools based on the current state of main here. I think I can get this done within the day.

Lastly, information dispatches on the length of the timeseries (i.e., if timeseries... is only x, the ComplexityMeasures.jl is called. If it is x, y, ... then CausalityTools.jl is called). This dispatch instruction can go into the Dev Docs as well.

Yes, that is a good idea, and is already how I do it in my (non-published yet) proposal in CausalityTools.jl, e.g. something like:

function counts(x::Vararg{VectorOrStateSpaceSet, N}) where N
    if N == 1
        return ComplexityMeasures.counts(UniqueElements(), x)
    else
        return counts(UniqueElements(), x...) # this does something completely different than in ComplexityMeasures.jl, but also returns a `Counts` instance
    end
end
kahaaga commented 12 months ago

We've gone back and forth on these matters quite a few times, and all redesign decisions have arisen due to one thing: the decisions we decided on turned out not to work as well as we thought in practice. I think the best course of action now is to actually try to finish up both packages with respect to the current API, then evaluate how the changes proposed here will affect both packages.

Since we're really developing an extendable API in ComplexityMeausures.jl, we need to make absolutely sure that it is in fact extendable, and that the API here doesn't lead to massive code complications and workaround in CausalityTools.jl. Having the OutcomeSpace as a field of ProbabilitiesEstimators was one decision that did work for one package, but not the other.

Multiple dispach should not be used to operate on more than 3 arguments.

I'm not entirely convinced about this argument, neither from the human side nor from the programmatic side.

It was clear before #316 that not having these three components separated was a bad idea programmatically, which is why we did #316 in the first place.

For the human perspective, we're trying to use code to model something with three distinct steps/components

I don't see why the call signature shouldn't mirror this. If these steps are required to compute the measure I'm interested in, then I should think actively about which formula, which probabilities estimator and which discretization approach I want to use. In this sense, I find information(def, probest, discretization_method, input_data...) perfectly natural. It mirrors the modularity we want to achieve perfectly in code.

I could, as many times before, change my mind, though. Modularity can be achieved in other ways too, for example like you suggest. But the suggested wrapping doesn't work for CausalityTools.

I'm guessing we end up with something like

""" The contingency estimator is a generic discrete estimator for any probabilities-based measure """
struct Contingency
    def::InfoMeasureDefinition
    pest::ProbabilitiesEstimator
    d::DiscretizationScheme # wraps `OutcomeSpace`s and/or `Encodings` and controls the choice of row/columns
end
Datseris commented 12 months ago

Second, on the probabilities estimator that contains the outcome space to be used. Probabilities estimators reference an outcome space.

This is not a good idea for CausalityTools.jl, and in fact it complicates things a lot. The probabilities estimators must be completely agnostic to where the counts that they estimate probabilities come from. The reason is that in CausalityTools, you get can counts both using Encodings and OutcomeSpaces to compute counts. Enforcing that OutcomeSpaces are coupled to the ProbabilitiesEstimators will require a non-intuitive and complicated workaround in CausalityTools.jl. This was the entire reason why we did #316 in the first place: because if the ProbabilitiesEstimators are coupled to OutcomeSpaces, it leads to problems for CausalityTools.

From what I understand it seems that in general probabilities estimators are simply incompatible with CausalityTools.jl. In fact, I have trouble understanding how they would be used. CausalityTools.jl uses counts anyways. Is it valid (scientifically) to just apply the probabilities estimators to the marginals? If probabilities estimators can't be used anyways, this design decision leaves CausalityTools.jl unaffected, as it would have to throw an error if a probability estimator is given instead of an outcome space.

Furthermore, what does it mean that "you can get counts both using Encodings and OutcomeSpaces "? These two things are one and the same. Count-based outcome spaces use an encoding. I don't get the differentiation here. Plus, encodings are not part of the public API of information anyway. A user cannot pass an encoding to information, so I just don't understand this statement.

Multiple dispach should not be used to operate on more than 3 arguments.

I'm not entirely convinced about this argument, neither from the human side nor from the programmatic side.

Sure, but I feel that perm_ent_y_q2 = entropy(Renyi(;q = 2.0), BayesianEstimation(), OrdinalPatterns(), y) is a complexity overkill (pun intended).

I am not opposed to the counter-proposal of not making probabilities estimators reference an outcome space. That's fine. What we need a resolution of this very high amount of input argument dispatch complexity. You may not see it as "not too complex", but looking at things from the eyes of a newcomer I fear it is.

I made a recommendation for resolution. But, there is another way. A way I can easily see a resolution of this is to simply forbid passing probabilities estimators to information. One would have to explicitly create Probabilities from data, given estimator and outcome space, and then pass the Probabilities instance to information.

Datseris commented 12 months ago

Since we're really developing an extendable API in ComplexityMeausures.jl, we need to make absolutely sure that it is in fact extendable, and that the API here doesn't lead to massive code complications and workaround in CausalityTools.jl. Having the OutcomeSpace as a field of ProbabilitiesEstimators was one decision that did work for one package, but not the other.

We have made absolutely sure that it is extendable. Extendability of a package only has meaning for the API the package itself defines. Not other packages. The set of features and API of CompelxityMeasure.sjl is perfectly extendable, one can add more estiamtors for complexity measures or outcome spaces. CausalityTools.jl tries to define additional API on top of the existing one. Don't forget that nothing stops you to create new function names and new types in CausalityTools.jl. There is no enforcement that you can only use functions only exported by ComplexityMeasures.jl.

I would argue that it is much more important that this is a good package, and simple to learn and use, on its own right. So all of this comes first and has higher precedence than the package being extendable by another package in new ways that are not part of the API of the original package. I would argue that in fact this should be the case for any design decision. The package by itself needs to be good.

Naturally, we try to achieve everything, and make everything perfect. But what I want to point out is that, in case it is not possible for everything to be perfectly harmonious, it is the dependent package that needs to adjust, not the independent. We should be basing the decisions in CausalityTools.jl on this, not the decisions in ComplexityMeasures.jl. Does this make sense? It's because creating new API in the dependent is easy, and free. Altering the existing API is hard.

kahaaga commented 12 months ago

I made a recommendation for resolution. But, there is another way. A way I can easily see a resolution of this is to simply forbid passing probabilities estimators to information. One would have to explicitly create Probabilities from data, given estimator and outcome space, and then pass the Probabilities instance to information.

I like this suggestion. It would completely avoid complex signatures, because it takes away two of the steps: discretization and probabilities estimation.

From what I understand it seems that in general probabilities estimators are simply incompatible with CausalityTools.jl. In fact, I have trouble understanding how they would be used. CausalityTools.jl uses counts anyways. Is it valid (scientifically) to just apply the probabilities estimators to the marginals? If probabilities estimators can't be used anyways, this design decision leaves CausalityTools.jl unaffected, as it would have to throw an error if a probability estimator is given instead of an outcome space.

There are two main ways of computing the multivariate information measures in CausalityTools.jl. The first is directly through a double/triple/quadruple/etc sum over a joint N-dimensional Probabilities instance. This is achieved by first estimating an N-dimensional Counts instance, on which a ProbabilitiesEstimator can be applied (not possible at the moment, but is doable programmatically - not sure if anyone has done so before and if it is valid theoretically, though).

The second option is by decomposing the measures in terms of entropies. Then, the entire machinery from ComplexityMeasures.jl is used to first discretize, then estimate probabilities (using any ProbabilitiesEstimator), then compute entropies, and then these entropies are combined. This sum-of-entropies approach is more biased than computing the joint probabilities directly, but is waaay faster. So there is a bias-speed tradeoff to be made.

In any case, both approaches are and should be possible. And in both cases, we need ProbabilitiesEstimators. Furthermore, these estimators should be separate from the outcome spaces.

We have made absolutely sure that it is extendable. Extendability of a package only has meaning for the API the package itself defines. Not other packages. The set of features and API of CompelxityMeasure.sjl is perfectly extendable, one can add more estiamtors for complexity measures or outcome spaces. CausalityTools.jl tries to define additional API on top of the existing one. Don't forget that nothing stops you to create new function names and new types in CausalityTools.jl. There is no enforcement that you can only use functions only exported by ComplexityMeasures.jl. I would argue that it is much more important that this is a good package, and simple to learn and use, on its own right. So all of this comes first and has higher precedence than the package being extendable by another package in new ways that are not part of the API of the original package. I would argue that in fact this should be the case for any design decision. The package by itself needs to be good. Naturally, we try to achieve everything, and make everything perfect. But what I want to point out is that, in case it is not possible for everything to be perfectly harmonious, it is the dependent package that needs to adjust, not the independent. We should be basing the decisions in CausalityTools.jl on this, not the decisions in ComplexityMeasures.jl. Does this make sense? It's because creating new API in the dependent is easy, and free. Altering the existing API is hard.

Yep, this makes sense.

The existing API in ComplexityMeasures.jl is perfectly compatible for what I envision for CausalityTools.jl. Complications only arise if we re-couple the ProbabilitiesEstimators and OutcomeSpaces.

I think a good compromise that avoid too much complexity is if we should consider what you said above:

A way I can easily see a resolution of this is to simply forbid passing probabilities estimators to information. One would have to explicitly create Probabilities from data, given estimator and outcome space, and then pass the Probabilities instance to information.

Furthermore, what does it mean that "you can get counts both using Encodings and OutcomeSpaces "? These two things are one and the same. Count-based outcome spaces use an encoding. I don't get the differentiation here.

Encodings operate directly on input data. OutcomeSpaces uses encodings internally, and may additionally add a preprocessing step, for example delay embeddings, before encodings are used. The first is directly applicable to state vectors. The second requires (for some, not all, outcome spaces) a data transformation first, before the encodings are applied.

kahaaga commented 12 months ago

... is a complexity overkill (pun intended).

😁

kahaaga commented 12 months ago

Furthermore, what does it mean that "you can get counts both using Encodings and OutcomeSpaces "? These two things are one and the same. Count-based outcome spaces use an encoding. I don't get the differentiation here. Plus, encodings are not part of the public API of information anyway. A user cannot pass an encoding to information, so I just don't understand this statement.

This will be more clear once I push the PR I'm working on. OutcomeSpaces can be applied to discretize in a column-wise manner. Encodings can be applied to points/rows of the marginals directly, without any data transformation occurring first. These are different use cases, and will both be possible in CausalityTools.jl.

Datseris commented 12 months ago

Well I will do a PR that does this:

A way I can easily see a resolution of this is to simply forbid passing probabilities estimators to information. One would have to explicitly create Probabilities from data, given estimator and outcome space, and then pass the Probabilities instance to information.

Additionally, I will re-work the docs a bit to put at the forefront the function

information(est::InformationEstimator, p::Probabilities)

because, this is the signature we really care about. everything else is a convenience dispatch that saves 1 or at most 2 lines of conde, before ending up calling this.

In CausalityTools.jl this is no longer teh case anymore. There it is much more complex to estimate all the marginals or individual entropies, plus to do the embeddings. So there the importance of information(estimator, outcomespace, x, y, ...) is much higher.

Datseris commented 12 months ago

I think, a probabilities estimator in CausalityTools.jl is probably best as given as an argument to the information estimator? Transfer entropy takes in the definition, the estimator for TE, and even an embedding. Might as well take in a probabilities estimator.

But, I think, since there is no scientific paper so far that has done it, let's leave this out. Seems like this is a possibility for a future paper and there we could allow the possibility of using probabilities estimators. For now it appears straightforward to simply dissalow it in CausalityTools.jl. ?

kahaaga commented 12 months ago

Like so?

kahaaga commented 12 months ago

I think, a probabilities estimator in CausalityTools.jl is probably best as given as an argument to the information estimator? Transfer entropy takes in the definition, the estimator for TE, and even an embedding. Might as well take in a probabilities estimator.

Yep. I think there will be a single estimator that has everything needed as fields.

kahaaga commented 12 months ago

But, I think, since there is no scientific paper so far that has done it, let's leave this out. Seems like this is a possibility for a future paper and there we could allow the possibility of using probabilities estimators. For now it appears straightforward to simply dissalow it in CausalityTools.jl. ?

Yep, we can just leave it out for now. It is quick to implement later if we decide to.

kahaaga commented 12 months ago

Well I will do a PR that does this.

Perfect!

Datseris commented 12 months ago

I am confused about the source code... It appears that the call signature

information(DiscreteInfoEstimator, Probabilities)

was never possible generically? How do you take into account

function information(hest::ChaoShen{<:Shannon}, pest::ProbabilitiesEstimator, o::OutcomeSpace, x)
    (; definition) = hest

    # Count singletons in the sample
    cts = counts(o, x)
    f₁ = 0
    for f in cts
        if f == 1
            f₁ += 1
        end
    end

for example?

kahaaga commented 12 months ago

Hm. I think you're right. In this case, the pest isn't used at all, since the ChaoShen estimator explicitly uses the plug-in probabilities.

But it could be modified so that probs = Probabilities(cts) becomes probs = probabilities(pest, cts), and then the correction is applied to those probabilities, and the case of the PlugIn estimator corresponds to the Chao-Shen case.

In any case, it is possible to do information(est::ChaoShen, p::Probabilities), however, then you'd have to re-compute the counts needed for the correction, since they are not part of the input Probabilities. That is double the work. I'm not sure the reduction in signature complexity is worth slowing down the code that much...

Datseris commented 12 months ago

By looking at the source code it appears that not a single entropy estimator can be used with Probabilities, besides the PlugIn one. All require counts. I have to think about this.

kahaaga commented 12 months ago

By looking at the source code it appears that not a single entropy estimator can be used with Probabilities, besides the PlugIn one. All require counts. I have to think about this.

Yes, most of them need the input data to work.

EDIT: ah, this is of course true for the ChaoShen estimator too.

kahaaga commented 12 months ago

We have made absolutely sure that it is extendable. Extendability of a package only has meaning for the API the package itself defines. Not other packages. The set of features and API of CompelxityMeasure.sjl is perfectly extendable, one can add more estiamtors for complexity measures or outcome spaces. CausalityTools.jl tries to define additional API on top of the existing one. Don't forget that nothing stops you to create new function names and new types in CausalityTools.jl. There is no enforcement that you can only use functions only exported by ComplexityMeasures.jl.

I just want to point out again that the current state of the main branch works perfectly. The complications related to entropy estimators and probabilities estimators only matter if we change the call signature of information again. If we don't, everything works just fine - at the expense of a few more input arguments, which I argue is a better compromise than doing yet another complete re-do of a syntax that we're arrived at through loong trial and error and that works just fine.

kahaaga commented 12 months ago

Naturally, we try to achieve everything, and make everything perfect. But what I want to point out is that, in case it is not possible for everything to be perfectly harmonious, it is the dependent package that needs to adjust, not the independent. We should be basing the decisions in CausalityTools.jl on this, not the decisions in ComplexityMeasures.jl. Does this make sense? It's because creating new API in the dependent is easy, and free. Altering the existing API is hard.

Another though: yes, it makes sense that things should be as good as possible in the independent package. However, development in the dependent package (which is part of the same ecosystem) has essentially been on hold information-measure-wise for the entire duration of the re-writes we've done here. It would be a bit silly if all the developments here ended up being useless for the dependent package and the dependent package would essentially have to re-write new types and functions that does precisely the same thing as what the existing types in ComplexityMeasures.jl does.

Naturally, we try to achieve everything, and make everything perfect.

I'd argue at this stage, given the complications of the last suggestion, that we don't try to make everything perfect, since we have something that works, is modular, is intuitive, and is extendable. It is near inevitable that we find that the new choices lead to suboptimality in some other regard.

Datseris commented 12 months ago

Sure, but what is the syntax for CausalityTools.jl? It is still not clear to me. Let's take the current state of main as the face value.

kahaaga commented 12 months ago

Sure, but what is the syntax for CausalityTools.jl? It is still not clear to me. Let's take the current state of main as the face value.

Let me finish up my draft for the changes after #316 was merged, so you can see. I'll try to finish it tonight.

kahaaga commented 11 months ago

https://github.com/JuliaDynamics/CausalityTools.jl/issues/352 is now closed. Short summary:

There is only one function exposed to users of CausalityTools:

information(est::MultivariateInformationMeasureEstimator{<:MultivariateInformationMeasure}, x...)

The estimator always contains the definition (est.definition), which is also the first type parameter. Convenience functions are just one-line wrappers to information, and have signatures like

joint_entropy(est::JointProbabilities, x, y) → h::Real

conditional_entropy(est::JointProbabilities, x, y) → h::Real

mutualinfo(est::MutualInformationEstimator, x, y, z) → mi::Real
mutualinfo(est::JointProbabilities, x, y, z) → mi::Real
mutualinfo(est::EntropyDecomposition, x, y, z) → mi::Real

condmutualinfo(est::CondiitionalMutualInformationEstimator, x, y, z) → cmi::Real
condmutualinfo(est::JointProbabilities, x, y, z) → cmi::Real
condmutualinfo(est::EntropyDecomposition, x, y, z) → cmi::Real
condmutualinfo(est::MIDecomposition, x, y, z) → cmi::Real

This is all based on the current state of main here. So from my part, I'm happy with the syntax as is.

Datseris commented 11 months ago

How did you deal with the co-existence of various different probabilities estimators and outcome spaces?

Datseris commented 11 months ago
joint_entropy(est::JointProbabilities, x, y) → h::Real

But this isn't what I think as a convenience function. If you treplace joint_entropy with information the code would just work as well right? so this isn't really a conveniecne.

What I would think as conveniecne function is

joint_entropy(o::OutcomeSpace, x, y) -> h
kahaaga commented 11 months ago

How did you deal with the co-existence of various different probabilities estimators and outcome spaces?

When estimating some multivariate measure as a combination of entropy terms, the user can decide on which ProbabilitiesEstimator to use, since the marginals are independent and you are in principle free to estimate each of them in any way you see fit. The estimator defaults to RelativeAmount. This is the signature of the entropy-based estimator

EntropyDecomposition(
        definition::MultivariateInformationMeasure,
        est::DiscreteInfoEstimator,
        discretization::OutcomeSpace,
        pest::ProbabilitiesEstimator = RelativeAmount()
)

In all other cases, the user doesn't have the option (yet) to specify a ProbabilitiesEstimator, because we don't yet know how these estimator will work for higher-dimensional Counts/Probabilities. For example, the JointProbabilities estimator has the following signature:

JointProbabilities(
    definition::MultivariateInformationMeasure,
    discretization::Discretization # e.g. an `OutcomeSpace`
)

What I would think as conveniecne function is joint_entropy(o::OutcomeSpace, x, y) -> h

If needed, one could define

function joint_entropy(o::OutcomeSpace, x, y)
    est = JointProbabilities(JointEntropyShannon(), o)
    return information(est, x, y)
end
kahaaga commented 11 months ago

Btw, I'm writing up a tutorial to summarize the multi-variable information API, including estimation, like you did here. It will be part of my PR soon.

Datseris commented 11 months ago

Okay this all seems fine to me. I am still unhappy about

perm_ent_y_q2 = entropy(Renyi(;q = 2.0), BayesianEstimation(), OrdinalPatterns(), y)

but there doesn't seem to be a way to resolve this.

kahaaga commented 11 months ago

Okay this all seems fine to me. I am still unhappy about

perm_ent_y_q2 = entropy(Renyi(;q = 2.0), BayesianEstimation(), OrdinalPatterns(), y)

but there doesn't seem to be a way to resolve this.

Yep, not if we want to be able to customize the probabilities estimator.