JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems
MIT License
59 stars 14 forks source link

Encoding API #171

Closed Datseris closed 1 year ago

Datseris commented 1 year ago

Alright, let's think here how we want to establish the encodings. From my view, encodings are an intermediate interface used by probabilities estimators. Here is what I propose:

An Encoding encodes elements into integers exclusively. If they are not integers, I actually don't see the need for an intermediate encoding. The production interface is:

  1. encoding(x, est::ProbabilitiesEstimator) - e::AbstractEncoding produces a type that can encode elements of x as integers. Not all probabilties estimators have encodings.
  2. encode(element, e::Encoding) -> i::Int encodes the element into integer
  3. decode(i::Int, e::Encoding) -> ω decodes the encoding into an outcome from the outcome space of the probabilities estimator used to create the encoding.

For Binnings, I have already this encoding becuase I actually use it in another project. It uses CartesianIndices and LinearIndices go back and forth from the encoding and the decoding.

kahaaga commented 1 year ago

At first glance/though, it seems reasonable to require encodings to return integers and decodings to return elements of the outcome space.

That way, the only tricky part for the various estimators is to decide precisely what the outcome space is/should be, which is partly a matter of taste.

Datseris commented 1 year ago

Should we quickly decide what encoding should return when invalid data points are given to it? E.g., when a data point outside the fixed histogram bounds, or a data point being 5 dimensional for a 4-order ordinal patterns.

I am not sure how to proceed, but erroring (what happens currently in my branch of the bin encoding) is not useful, at least not for the binnings. Points falling "outside" the histogram should simply not affect the probabilities, but now they just stop operation as they error.

kahaaga commented 1 year ago

I'm also not sure what the best way to proceed in general.

If we want to maintain the relation encode(x[i], ...) -> Int for an input dataset x, then we need a systematic way of handling it. But I'm not sure that's the best way to deal with it.

For ordinal patterns, I guess it would make sense to just throw an OutOfBoundsError or something like that, because it is not possible to encode a D1-dimensional vector to a D2-dimensional ordinal pattern if D1 != D2.

Perhaps throwing an error should be the default behaviour, but we allow exceptions, e.g. for the histograms. In that case, encoding using a RectangularBinEncoder simply returns encodings::Vector{Int} where length(encodings) <= length(x), i.e. points outside the binnign are simply discarded. I think this should be okay if it is documented.