Current status of API is unclear

Datseris commented 1 year ago

@kahaaga I've merged main (which is #168 ) into my branch that defines the encoding API and also implements it for histograms. Things remain unclear to me, mainly because discussions were spread across many different issues and PRs and I no longer remember where the information is supposed to be.

Is it correct that all main API functions start with the estimator types? This means that we have:

entropy(e::Entropy, p::ProbEst, x, ...)
probabilities(p::ProbEst, x, ...)
probabilities_and_outcomes(p::ProbEst, x, ...)

But as far as I can tell, these are the only two functions that follow this appraoch of having first the type. The following functions have the different approach of having first input x, and then the probabilities type:

outcome_space
total_outcomes
missing_outcomes
outcomes

Can you please explain what is the overall plan? Should we make all functions follow the first or the second version in the order of arguments?

kahaaga commented 1 year ago

My opinion: we make all functions follow the first order. That is: type first, then the argument.

Datseris commented 1 year ago

ok I do this in a new PR

Datseris commented 1 year ago

@kahaaga and what do we do about the encode and decode functions? They were encode(χ, e::Encoding) before. Should the order be changed?

I have to admit I find it more natural to have encoding(element, encoding). Because it reads "encode this value using that encoding." The other way around it doesn't read naturally: encode, using this encoding, this value.

Which also makes me think, why did we change the probabilities interface to have the type first? For the entropies I got your argument: it made extending entropies to joint entropies natural. But does this argument carry over to probabilities? Notice that the Julia style guide says "put type first", but this is a bit missleading: this is for the cases that the type decides the return type, such as in rand(Int, x). In our case the estimator doesn't decide the return type, only the return value.

On the other hand, it would be good to have an idea of how in general the broader Julia community does this. Is the "algorithm deciding argument" before or after the input data? In JuliaDynamics overall it is first input data and then algorithm deciding argument. Eg. that's how it is done in RecurrenceAnalysis.jl or TimeseriesSurrogates.jl: surrogate(x, method). I could be in the wrong though, and everywhere else it is done the other way around. Let's try to collect at least a couple of examples from other repos?

Datseris commented 1 year ago

I've opened this: https://discourse.julialang.org/t/design-q-order-of-arguments-input-data-and-algorithm-type/91768

kahaaga commented 1 year ago

Good idea on consulting with the community!

I have to admit I find it more natural to have encoding(element, encoding). Because it reads "encode this value using that encoding." The other way around it doesn't read naturally: encode, using this encoding, this value.

Well, it depends on how you phrase it. Take encode(x, method::A). There's "encode x using method of type A", and there's "A-encode x". Both feels pretty natural to me.

I don't feel strongly about either option beyond the vararg-argument, but the vargarg-argument weights pretty strongly in my opinion. I've gone through several rounds of design proposal for the CausalityTools methods without coming up with anything reasonably developer-friendly, because without it, we're back to writing a bunch of specialized methods for all input argument combinations. This is a pain, is error-prone, hard to document and hard to debug.

A third option is to have the types as the first argument to methods that will never accept vargargs as the last argument. But how do we know if a method won't be extended with varargs later? In some cases it is obvious that it will never happen, but in other cases it is not so clear.

Which also makes me think, why did we change the probabilities interface to have the type first?

I think the reasoning was just "follow the same style everywhere". But since there is no established guidelines on the topic, we could also just decide on a style guide ourselves.

Anyways, we should see what the community thinks too.

I'll try to find some examples from well-established packages in the meanwhile.

kahaaga commented 1 year ago

Some examples from other packages with functionality for estimating some statistic from data (I'll fill this out as I investigate more)

StatsBase.jl

Implements different algorithms as a different function, and parameters as keyword arguments. They do knuths_sample and fisher_yates_sample, where we'd do sample!(Knuths(), args...) and sample(FisherYates(), args...).

Distributions.jl

Uses a similar type-based approach, where each distributions is its own type. They dispatch on type-first, then input. For example, fit_mle(dist::D, x), where D <: Distribution. However, here the return type is also of type D, so this again is not really comparable to our situation.

Distances.jl

Perhaps the best example of using the same approach as us: distance type first, then input data, and return type is not the same as the distance type. For example, pairwise(d::Distance, args...) returns a vector.

kahaaga commented 1 year ago

Ok, a radically different suggestion, regardless of argument order: use functors instead, like we do in TimeseriesSurrogates. For motivation, see the last comment in https://discourse.julialang.org/t/how-are-functors-useful/24110/28.

What if we instead of having entropy(e::Shannon, args...), we define EntropyShannon() and use it as a functor? In general, it would look something like:

struct EntropyShannon; params...; end
struct EntropyRenyi; params...; end
struct EntropyTsallis; params...; end
...

(e::EntropyShannon)(est, x...)
(e::EntropyRenyi)(est, x...)
(e::EntropyTsallis)(est, x...)

This would also simplify stuff in CausalityTools, where on the v2 branch one would do

struct MIShannon <: MutualInformation; params...; end
struct MITsallis <: MutualInformation; params...; end
....

mutualinfo(def::MITsallisTypeA,  measure::MITsallis, est::SomeEstimator, x, y)
mutualinfo(def::MITsallisTypeB,  measure::MITsallis, est::SomeEstimator, x, y)
....

With a functor approach, this would be the much less verbose

(mi::MITsallis)(def::MITsallisTypeA, est, x, y, ....) # seamlessly compute multi-information using varargs
(mi::MITsallis)(def::MITsallisTypeA, est, x, y, ...)

I actually like this a lot more than the current approach. Moreover, implementing generator-type structs, as desired in #70, would come very naturally.

One would have to think a bit about how to implement the multiscale interface, but I think that should be doable...

Datseris commented 1 year ago

It seems that the community generally leans towards type first, input afterwards. So let's do this everywhere.

I don't see any immediate benefit for functors for entropies. At the moment it appears that you have to create exactly the same amount of structs and define the same amount of new functions. And a user would have to learn exactly the same amount of new things plus one more, the concept of functors, which I don't think is at all as a basic of a concept as a function.

So why use something conceptiually more complicated than functions?

mutualinfo(def::MITsallisTypeA,  measure::MITsallis, est::SomeEstimator, x, y)

off-topic, but I think that this is too much for a user. I wouldn't define MIType at all and just use different functions all together. Having three different types to decide the "algorithm" used probably is a good hint that the name of the function should be where the specialization should occur at.

Datseris commented 1 year ago

Ok I read now the discourse post you linked that argues in favor of functors for establishing a 1-argument-function interface. Hence, this would be only valid if we could achieve that. We can't; we will always have a probabilities estimator and a data input. The alternative of passing the probability estimator as a "parameter" to the entropy functor goes again what we want to achieve here on an educational level: that the entropy and the probability extraction are two genuinely different things.

kahaaga commented 1 year ago

And a user would have to learn exactly the same amount of new things plus one more, the concept of functors, which I don't think is at all as a basic of a concept as a function.

Good point.

So why use something conceptiually more complicated than functions?

The advantage would be pre-allocation for speed ups during repeated computations (i.e. surrogate testing). But this is a deeper issue that we should address later, when all related packages have a stable release. That is, there are probably advantages to user a similar approach for entropy, probabilities, mutualinfo, transferentropy etc for pre-allocations and repeated computations.

It seems that the community generally leans towards type first, input afterwards. So let's do this everywhere.

Ok, fine with me! Let's do that.

Ok I read now the discourse post you linked that argues in favor of functors for establishing a 1-argument-function interface. Hence, this would be only valid if we could achieve that. We can't; we will always have a probabilities estimator and a data input. The alternative of passing the probability estimator as a "parameter" to the entropy functor goes again what we want to achieve here on an educational level: that the entropy and the probability extraction are two genuinely different things.

Yes, let's stick to our current approach.

kahaaga commented 1 year ago

off-topic, but I think that this is too much for a user. I wouldn't define MIType at all and just use different functions all together. Having three different types to decide the "algorithm" used probably is a good hint that the name of the function should be where the specialization should occur at.

I opened an issue for this in CausalityTools. See the linked issue above.

Datseris commented 1 year ago

The advantage would be pre-allocation for speed ups during repeated computations (i.e. surrogate testing).

This is possible just as well with the struct. The pre-allocation happens once you initialize e = Shannon(). It doesn't matter if you call e as e(est, x) or entropy(e, est, x). At least with our surrogate package it could have happened just like that just as well. You could have r = RandomPhase(); surrogenerator(r, x) and r has pre-initialized the FT plan.

okay, I'll finish #187 tomorrow then with the final decided upon API which is: types are always first argument, but no functors.

kahaaga commented 1 year ago

This is possible just as well with the struct. The pre-allocation happens once you initialize e = Shannon(). It doesn't matter if you call e as e(est, x) or entropy(e, est, x). At least with our surrogate package it could have happened just like that just as well. You could have r = RandomPhase(); surrogenerator(r, x) and r has pre-initialized the FT plan.

Ah, yes. Let's keep this in mind for later.

okay, I'll finish https://github.com/JuliaDynamics/Entropies.jl/pull/187 tomorrow then with the final decided upon API which is: types are always first argument, but no functors.

Sounds good.

JuliaDynamics / ComplexityMeasures.jl