Closed Datseris closed 1 year ago
I like outcome_space
a lot as well btw as an alternative name.
I like outcome_space a lot as well btw as an alternative name.
Then let's use outcome_space
. That also aligns well with how the docstrings for the various probability estimators are worded at the moment.
I am thinking to not restrict outcome_space
to return Vector
no matter what. For example, for the binning here the most efficient return is simply the cartesian indices in array form. Right? For most estimators the return type is indeed a vector.
Allows FixedRectangularBinning to have different N along each dimension. Why didn't we do this from the start? If the epsilons are fixed, N can still be different along each dimension while we still gain the unique histogram we want)
This should definitely be included. I didn't implement it to begin with because I didn't need it at the time, so I didn't spend any time on actually implementing it.
This functionality isn't relevant for the implementation all_possible_outcomes
, which this PR is about, though. Perhaps just open an issue for different-N-along-each-dimension, and implement it in a separate PR? This isn't crucial for 2.0 anyways.
@kahaaga I realize now, we can remove one mandatory extension: we can make by default total_outcomes(x, est) = length(outcome_space(x, est))
by default. And only if getting the full outcome space is costly then one can directly extend total_outcomes
. Yay or nay?
@kahaaga I realize now, we can remove one mandatory extension: we can make by default total_outcomes(x, est) = length(outcome_space(x, est)) by default. And only if getting the full outcome space is costly then one can directly extend total_outcomes. Yay or nay?
How didn't we think of this before? Super elegant. A definitive yay!
@kahaaga we need to agree on and decide what to do with your:
"""
outcomes(x, scheme::Encoding) → Vector{Int}
outcomes!(s, x, scheme::Encoding) → Vector{Int}
Map each`xᵢ ∈ x` to a distinct outcome according to the encoding `scheme`.
Optionally, write outcomes into the pre-allocated symbol vector `s` if the `scheme`
allows for it. For usage examples, see individual encoding scheme docstrings.
See also: [`RectangularBinEncoding`](@ref), [`GaussianCDFEncoding`](@ref),
[`OrdinalPatternEncoding`](@ref).
"""
function outcomes(x::X, ::Encoding) where X
throw(ArgumentError("`outcomes` not defined for input data of type $(X)."))
end
Firstly, the default extension isn't helpful. It isn't the type of X that should be deciding if the method exists, it's the type of the encoding. Second, the encodings return integers? We have to really decide and clarify this. Personally I think outcomes
must be meaningful, intuitive, and each estimator must define their own outcome space
Here's the possiblities:
m
-dimensional vectors for permutation, bins for the binning, etc. i::Int
is the "element" Omega[i]
. All of this gets so confusing... On one hand, for performance and processing in downstream packages, having just integers is very helpful. on the other hand. this is really, really confusing for me. I am already lost and don't know what to return in probabilities_and_outcomes
for practically any estimator...It isn't the type of X that should be deciding if the method exists, it's the type of the encoding.
Some encodings work on univariate timeseries, while others also work on multivariate data. That's why this method exists. RectangularBinEncoder
works on both, while GaussianCDFEncoding
is only defined for univariate timeseries.
Second, the encodings return integers?
The signature of this method must be updated to reflect the new situation: that different probabilities estimators have different outcome spaces. The signature should be
outcomes(x, scheme::Encoding) → some_iterable
outcomes!(s, x, scheme::Encoding) → some_iterable
Converts each xᵢ ∈ x to an outcome. The type of the outcomes depends on the encoding `scheme`,
but this method always returns an iterable of outcomes.
Optionally, write outcomes into the pre-allocated symbol vector `s` if the `scheme`
allows for it. The element type of `s` must match what the scheme returns.
For usage examples, see individual encoding scheme docstrings.
(or something like that)
anything that returns outcomes must return elements of the outcome space. This is m-dimensional vectors for permutation, bins for the binning, etc.
I strongly favor this option. However, going for this option will cause us to be "in limbo" for a while, until #87 is resolved. This is not a problem, as long as we document it and release a breaking change when #87 is fixed.
For many estimators, the outcome space will indeed be integers:
WaveletOverlap
: outcome space is the wavelet scale levels 1, 2, ..., maxlevel
.SymbolicPermutation
, SymbolicWeightedPermutation
, SymbolicAmplitudeAwarePermutation
: outcome space is (for now) the integers 1, 2, ..., m!
RectangularBinEncoder{FixedRectangularBinning}
: outcome space is ... an iterable something else?I think the most important part about the outcomes is that the user knows what they are. Good docstrings for the estimators (as I think we have now) takes care of that.
I think the most important part about the outcomes is that the user knows what they are. Good docstrings for the estimators (as I think we have now) takes care of that.
That's one important factor. There are two more important factors:
p_i
versus some setting α
. How do I keep track of p_i
consistently? I can't just use probabilities
because the vector is arbitrarily ordered and may or may not have zeros.Hence, here is the my proposal:
outcomes must be as complicated as possible to convey the most amount of information possible without necessary further processing. However, they must also be simple enough to be used as dictionary keys. I.e., they must be hashable. Integers, Floats and static vectors of them are hashable for example. Strings are also hashable. Functions aren't.
If you can make outcomes dictionary keys then you can keep track of changes in a probability of a specific outcome by using probabilities_and_outcomes
and using outcome[i]
as a key for probability[i]
.
Yay or nay?
This means that the outcomes of permutation symbolic stuff cannot be integers, because this doesn't give you a clue of what outcome this actually is. Is integer 1
the setting of the pattern 3, 2, 1? or any of the other 6 possible patterns?
For the wavelets we have to think what the best outcome type is. I guess for most estimators we have to guess what the outcome type should be. Integer is the easy way out, but its not intuitive. You never know what integer "1" means.
An alternative is as follows:
probabilities_and_outcomes
returns outcomes
which is always Vector{Int}
. This saves us space and computations (maybe???). Then, outcome_space
returns a vector Any, whihc is the outcomes in as much description. j::Int = outcome[i]
is the integer of the outcome with probability p[i]
, and the outcome itself is encoded in high accuracy in outcome_space[j]
.
Although, I am not sure whether this extra layer is worth the effort. It's only benefit would be to have an outcome space that isn't hashable. But in the worse case scenario,. such as wavelets, we cna still make theoutcome space the integers and just say in the docstring what the integers are...?
@kahaaga I am looking at the dispersion method. That method requires the "encoding" to be the integers, right? In this case, perhaps outcomes, probabilities_and_outcomes
stay purely in the "outcome space", but the encodings themselves return integers?
We should have a zoom call to finalize this, it's much more efficient. I also wnat to argue that Encoders have no reason to be part of the public API. A singe Dispersion
method doesn't justify having such a huge addition to the public api, and, furthermore, the Dispersion method can just take in probability estimators instead... Encodings should remain internal.
We should have a zoom call to finalize this, it's much more efficient.
Yes, let's do that, @Datseris. If you want it done swiftly, I'm available today before 1700 ECT, or today after 2000 ECT. Alternative, sometime during work hours tomorrow (i.e. 09-17 ECT ish)
ok, I booked it tomorrow at 10am CET
ok, I booked it tomorrow at 10am CET
Sweet. I'll share a Zoom link here just before 10am then.
@Datseris Here's the meeting invitation:
https://uib.zoom.us/j/61840466135?pwd=eVNIbDRudDQ4WVdlSXpLYkZUU3dKZz09
Okay, I am stopping here. Before merging this PR, I'll update/fix all the tests, or as many as possible. Then I will merge. What is left todo, to be done in different PRs:
utils.md
has to be deleted. In this PR, the Dispersion
estimator needs to be updated to not rely on GaussianCDFEncoding
in its documentation string.Okay, I am stopping here. Before merging this PR, I'll update/fix all the tests, or as many as possible. Then I will merge.
Sounds good.
What is left todo, to be done in different PRs:
It would be nice if you could open issues for these outstanding todos, so we'll remember to address them.
This PR does:
all_possible_outcomes
ValueHistogram
withFixedRectangularBinning
, which both call the method forRectangularBinEncoder
, which errors for non-fixed binning.