Adjust `GaussianCDFEncoding` and related stuff for new encoding API

kahaaga commented 1 year ago

I have now adjusted the codebase so that GaussianCDFEncoding obeys the encoding API. This addresses another point in #190.

Previously, GaussianCDFEncoding was only applicable to vector-valued input, meaning that we had to encode an entire timeseries in bulk. By adjusting the type from GaussianCDFEncoding(; c::Int = 3) to GaussianCDFEncoding(; μ, σ, c::Int = 3), we now require the user to explicitly specify the mean and standard deviation of the desired normal CDF. Thus, encode and decode now both handle scalar-valued inputs (i.e. no need for entire timeseries, because μ and σ must be pre-computed).
In the Dispersion struct, instead of explicitly constructing an encoding (i.e. GaussianCDFEncoding), we just use c::Int, which dictates how many bins to divide the range of the CDF into. However, we add a field encoding, which specifies a type of encoding, and can be any encoding type that accepts the keyword c. The default is GaussianCDFEncoding. It is now trivial to replace GaussianCDFEncoding with WhateverCDFEncoding, by just doing e.g. Dispersion(; encoding = WhateverCDFEncoding, c = 5). Internally, this will instantiate a WhateverCDFEncoding(; c = 5) in the relevant location in the code. However, the encoding field is not yet part of the public API. I don't think it should be public yet either, because I have an idea to make this even more generic.
Empirical mean and standard deviation computations for Dispersion are delegated to symbolize_for_dispersion, which also instantiates the encoding.
I let the linear mapping from the (0, 1) range of the CDF to the integers 1, 2, ..., c to just use an equidistant binning. This is essentially a FixedRectangularBinning(0, 1, c) (but without explicitly constructing a `RectangularBinEncoding, since it would have to be constructed once per encoding, which is expensive). This differs slightly from the original paper, but makes much more sense. This choice is now well documented, and makes much more intuitive sense than the previous mapping.

The alternative to this entire approach would be to extend the encoding API to be defined for vector-valued inputs, which I don't want to do, because it complicates things unnecessarily. I think the approach here is more elegant anyways, because it is now trivial to user some other encoding by just specifying the encoding keyword to Dispersion (again, this is not in the public API atm).

Other stuff:

Fixed some more NaiveKernel tests.
[x] Test pass.
[x] Documentation generation is successful.

@datseris I will merge this immediately when CI is done, so I can finish up #126 (which depends on this stuff). If you have any comments, just leave them here (or open issues if necessary), and I will address them while working on #126.

Datseris commented 1 year ago

it's okay, I have planned that we have to do a full review of the entire code base before publish anyways. I just haven't told you about this plan yet xD xD xD

kahaaga commented 1 year ago

I figured as much 🤖

JuliaDynamics / ComplexityMeasures.jl

Adjust `GaussianCDFEncoding` and related stuff for new encoding API #199