Declare`all_possible_outcomes` and implement it for `FixedRectangularBinning`.

Datseris commented 1 year ago

This PR does:

[x] reorganizing a bit the binning source file and remove many unecessary methods
[x] declare the public API function for all_possible_outcomes
[x] Implement it for ValueHistogram with FixedRectangularBinning, which both call the method for RectangularBinEncoder, which errors for non-fixed binning.
[ ] Update tests.

Datseris commented 1 year ago

I like outcome_space a lot as well btw as an alternative name.

kahaaga commented 1 year ago

I like outcome_space a lot as well btw as an alternative name.

Then let's use outcome_space. That also aligns well with how the docstrings for the various probability estimators are worded at the moment.

Datseris commented 1 year ago

I am thinking to not restrict outcome_space to return Vector no matter what. For example, for the binning here the most efficient return is simply the cartesian indices in array form. Right? For most estimators the return type is indeed a vector.

kahaaga commented 1 year ago

Allows FixedRectangularBinning to have different N along each dimension. Why didn't we do this from the start? If the epsilons are fixed, N can still be different along each dimension while we still gain the unique histogram we want)

This should definitely be included. I didn't implement it to begin with because I didn't need it at the time, so I didn't spend any time on actually implementing it.

This functionality isn't relevant for the implementation all_possible_outcomes, which this PR is about, though. Perhaps just open an issue for different-N-along-each-dimension, and implement it in a separate PR? This isn't crucial for 2.0 anyways.

Datseris commented 1 year ago

@kahaaga I realize now, we can remove one mandatory extension: we can make by default total_outcomes(x, est) = length(outcome_space(x, est)) by default. And only if getting the full outcome space is costly then one can directly extend total_outcomes. Yay or nay?

kahaaga commented 1 year ago

@kahaaga I realize now, we can remove one mandatory extension: we can make by default total_outcomes(x, est) = length(outcome_space(x, est)) by default. And only if getting the full outcome space is costly then one can directly extend total_outcomes. Yay or nay?

How didn't we think of this before? Super elegant. A definitive yay!

Datseris commented 1 year ago

@kahaaga we need to agree on and decide what to do with your:

"""
    outcomes(x, scheme::Encoding) → Vector{Int}
    outcomes!(s, x, scheme::Encoding) → Vector{Int}

Map each`xᵢ ∈ x` to a distinct outcome according to the encoding `scheme`.

Optionally, write outcomes into the pre-allocated symbol vector `s` if the `scheme`
allows for it. For usage examples, see individual encoding scheme docstrings.

See also: [`RectangularBinEncoding`](@ref), [`GaussianCDFEncoding`](@ref),
[`OrdinalPatternEncoding`](@ref).
"""
function outcomes(x::X, ::Encoding) where X
    throw(ArgumentError("`outcomes` not defined for input data of type $(X)."))
end

Firstly, the default extension isn't helpful. It isn't the type of X that should be deciding if the method exists, it's the type of the encoding. Second, the encodings return integers? We have to really decide and clarify this. Personally I think outcomes must be meaningful, intuitive, and each estimator must define their own outcome space

Here's the possiblities:

anything that returns outcomes must return elements of hte outocme space. This is m-dimensional vectors for permutation, bins for the binning, etc.
Outcomes are integers. The outcome space is whatever other thing. So that outcome i::Int is the "element" Omega[i]. All of this gets so confusing... On one hand, for performance and processing in downstream packages, having just integers is very helpful. on the other hand. this is really, really confusing for me. I am already lost and don't know what to return in probabilities_and_outcomes for practically any estimator...

kahaaga commented 1 year ago

It isn't the type of X that should be deciding if the method exists, it's the type of the encoding.

Some encodings work on univariate timeseries, while others also work on multivariate data. That's why this method exists. RectangularBinEncoder works on both, while GaussianCDFEncoding is only defined for univariate timeseries.

Second, the encodings return integers?

The signature of this method must be updated to reflect the new situation: that different probabilities estimators have different outcome spaces. The signature should be

    outcomes(x, scheme::Encoding) → some_iterable
    outcomes!(s, x, scheme::Encoding) → some_iterable

Converts each xᵢ ∈ x to an outcome. The type of the outcomes depends on the encoding `scheme`,
but this method always returns an iterable of outcomes. 

Optionally, write outcomes into the pre-allocated symbol vector `s` if the `scheme`
allows for it. The element type of `s` must match what the scheme returns. 

For usage examples, see individual encoding scheme docstrings.

(or something like that)

anything that returns outcomes must return elements of the outcome space. This is m-dimensional vectors for permutation, bins for the binning, etc.

I strongly favor this option. However, going for this option will cause us to be "in limbo" for a while, until #87 is resolved. This is not a problem, as long as we document it and release a breaking change when #87 is fixed.

For many estimators, the outcome space will indeed be integers:

WaveletOverlap: outcome space is the wavelet scale levels 1, 2, ..., maxlevel.
SymbolicPermutation, SymbolicWeightedPermutation, SymbolicAmplitudeAwarePermutation: outcome space is (for now) the integers 1, 2, ..., m!
RectangularBinEncoder{FixedRectangularBinning}: outcome space is ... an iterable something else?
etc..

kahaaga commented 1 year ago

I think the most important part about the outcomes is that the user knows what they are. Good docstrings for the estimators (as I think we have now) takes care of that.

Datseris commented 1 year ago

I think the most important part about the outcomes is that the user knows what they are. Good docstrings for the estimators (as I think we have now) takes care of that.

That's one important factor. There are two more important factors:

They are sensible. They make immediate sense and the user immediatelly understands what they mean. The integers don't fullfill this role. I don't know what outcome=1 is. Because it's an enumerator. It can be anything from a sorted list of "things".
They can be used as dictionary keys. For me a crucial thing, that I haven't thought so far, is being able to tract how the probabiliti of a specific event changes with a change of a system parameter, or in any case changes across a parameteric dimension. I want to be able to plot p_i versus some setting α. How do I keep track of p_i consistently? I can't just use probabilities because the vector is arbitrarily ordered and may or may not have zeros.

Hence, here is the my proposal:

outcomes must be as complicated as possible to convey the most amount of information possible without necessary further processing. However, they must also be simple enough to be used as dictionary keys. I.e., they must be hashable. Integers, Floats and static vectors of them are hashable for example. Strings are also hashable. Functions aren't.

If you can make outcomes dictionary keys then you can keep track of changes in a probability of a specific outcome by using probabilities_and_outcomes and using outcome[i] as a key for probability[i].

Yay or nay?

This means that the outcomes of permutation symbolic stuff cannot be integers, because this doesn't give you a clue of what outcome this actually is. Is integer 1 the setting of the pattern 3, 2, 1? or any of the other 6 possible patterns?

For the wavelets we have to think what the best outcome type is. I guess for most estimators we have to guess what the outcome type should be. Integer is the easy way out, but its not intuitive. You never know what integer "1" means.

Datseris commented 1 year ago

An alternative is as follows:

probabilities_and_outcomes returns outcomes which is always Vector{Int}. This saves us space and computations (maybe???). Then, outcome_space returns a vector Any, whihc is the outcomes in as much description. j::Int = outcome[i] is the integer of the outcome with probability p[i], and the outcome itself is encoded in high accuracy in outcome_space[j].

Although, I am not sure whether this extra layer is worth the effort. It's only benefit would be to have an outcome space that isn't hashable. But in the worse case scenario,. such as wavelets, we cna still make theoutcome space the integers and just say in the docstring what the integers are...?

Datseris commented 1 year ago

@kahaaga I am looking at the dispersion method. That method requires the "encoding" to be the integers, right? In this case, perhaps outcomes, probabilities_and_outcomes stay purely in the "outcome space", but the encodings themselves return integers?

Datseris commented 1 year ago

We should have a zoom call to finalize this, it's much more efficient. I also wnat to argue that Encoders have no reason to be part of the public API. A singe Dispersion method doesn't justify having such a huge addition to the public api, and, furthermore, the Dispersion method can just take in probability estimators instead... Encodings should remain internal.

kahaaga commented 1 year ago

We should have a zoom call to finalize this, it's much more efficient.

Yes, let's do that, @Datseris. If you want it done swiftly, I'm available today before 1700 ECT, or today after 2000 ECT. Alternative, sometime during work hours tomorrow (i.e. 09-17 ECT ish)

Datseris commented 1 year ago

ok, I booked it tomorrow at 10am CET

kahaaga commented 1 year ago

ok, I booked it tomorrow at 10am CET

Sweet. I'll share a Zoom link here just before 10am then.

kahaaga commented 1 year ago

@Datseris Here's the meeting invitation:

https://uib.zoom.us/j/61840466135?pwd=eVNIbDRudDQ4WVdlSXpLYkZUU3dKZz09

Datseris commented 1 year ago

Okay, I am stopping here. Before merging this PR, I'll update/fix all the tests, or as many as possible. Then I will merge. What is left todo, to be done in different PRs:

[ ] update documentation to remove encodings. Simply, the file utils.md has to be deleted. In this PR, the Dispersion estimator needs to be updated to not rely on GaussianCDFEncoding in its documentation string.
[ ] Update transfer operator
[ ] update permutation symboclic stuff to outcome space

kahaaga commented 1 year ago

Okay, I am stopping here. Before merging this PR, I'll update/fix all the tests, or as many as possible. Then I will merge.

Sounds good.

What is left todo, to be done in different PRs:

It would be nice if you could open issues for these outstanding todos, so we'll remember to address them.

JuliaDynamics / ComplexityMeasures.jl

Declare`all_possible_outcomes` and implement it for `FixedRectangularBinning`. #162