JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems
MIT License
53 stars 12 forks source link

Generalize `GaussianCDFEncoding` to arbitrary CDF encoding #218

Open Datseris opened 1 year ago

Datseris commented 1 year ago

This is some generality improvements for the current GaussianCDFEncoding and Dispersion. In general, any CDF could be used in the source code of the encoding; one could store the CDF function in the encoding struct. E.g., give some timeseries x generate the function:

m, s = mean(x), std(x)
f = x -> gaussian(x, m, s)

Any other univariate function instead of f could be generated. This function then is stored as a field in a new struct CDFEncoding, that uses the exact source code of GaussianCDFEncoding but using f instead of the existing gaussian function.

Then, this is super easily propagated into Dispersion: that type should initrialize a CDFEncoding and store the encoding directly as its field. If given only a timeseries, it defaults to getting mean, std and initializing the Gaussian encoding.

kahaaga commented 1 year ago

The encoding could in principle be generalizable to multidimensional CDF functions as well, although I'm not sure something like that exists in the literature yet in the context of these "entropy-like" quantities. The function f is just the input to quadgk (which only handles univariate functions at the moment).

A completely generic version of CDFEncoding could be something like

Base.@kwdef struct CDFEncoding <: Encoding
    precomputed_stuff::NamedTuple # e.g. mean and std
    f::Function =  exp((-(xᵢ - μ)^2)/(2σ^2)) # or something else for another CDF
    lb::T # lower integration bound
    up::T # upper integration bound
    integrator::Function = quadgk
end

Or something along those lines, depending on the call signature of quadgk or whatever other integrator one would use for multidimensional input.

EDIT:

Alternatively, one could drop the integrator stuff in the CDFEncoding stuff and rather have CDFEncoding{D} <: Encoding, where D is the dimension of the data. Then one could dispatch separately for 1D (using quadgk for integration), >=2D data (using some other integrator).

Datseris commented 1 year ago

you don't need to have precomputed_stuff. Simply make the function f = x -> exp((-(xᵢ - μ)^2)/(2σ^2)) by calculating or using μ, σ. The closure already stores the numbers. But also, not sure what is the use here of hyper generalizing: higher dimensions and different integrator functions don't really fit the need of the struct. Univariate cumulate distribution functions still make sense in context though. Also no need for the integration bounds, as from -inf to x makes sense because thats by definition what gives you the probability from a CDF.