Statistical complexity - Githubissues

ikottlarz commented 1 year ago

Hi there,

I just added the (generalized version) of the statistical complexity. There are two open questions:

In the current API, complexity returns only one value. The statistical complexity is directly dependent on a corresponding information measure and therefore only used in combination with it. Currently, I'm returning only the statistical complexity although I calculate the information measure as well. Is this the way to go?
The algorithm for the "minimum-/ maximum complexity curves" assumes that we have some ProbabilitiesEstimator where the total_outcomes are known a priori, without a specific time series being needed. Since we allow for generic ProbabilitiesEstimators here, I'm currently throwing an ArgumentError if the ProbabilitiesEstimator is not the "classical" SymbolicPermutation (or derived estimator). Should this be extended to general estimators before the merge (I currently don't have a clear idea how)?

kahaaga commented 1 year ago

In the current API, complexity returns only one value. The statistical complexity is directly dependent on a corresponding information measure and therefore only used in combination with it. Currently, I'm returning only the statistical complexity although I calculate the information measure as well. Is this the way to go?

The API doesn't specify what the return value is (see documentation string for ComplexityMeasures.complexity). It simply states that the complexity measure is returned. If the "statistical complexity" is a quantity that consists of a tuple of two separate quantities (complexity, measure) , then the entirety of that tuple is the complexity measure. That's my take, at least. What do you think, @Datseris?

The algorithm for the "minimum-/ maximum complexity curves" assumes that we have some ProbabilitiesEstimator where the total_outcomes are known a priori, without a specific time series being needed. Since we allow for generic ProbabilitiesEstimators here, I'm currently throwing an ArgumentError if the ProbabilitiesEstimator is not the "classical" SymbolicPermutation (or derived estimator). Should this be extended to general estimators before the merge (I currently don't have a clear idea how)?

Judging from the code, it seems to me that the statistical complexity is simply a function of a probability distribution. If so, it should work (without any extra effort from your side when coding) with any ProbabilitiesEstimator.

If my understanding is correct, you don't actually need to implement anything extra for the algorithm to be completely generic. Just remove the ArgumentError, and you'll be good to go. The only potential issue is if entropy_normalized isn't defined for the particular ProbabilitiesEstimator, but then an informative error message will be thrown when the line involving entropy_normalized is called. Likewise, the algorithm may fail if the ProbabilitiesEstimator doesn't have an implementation for total_outcomes, but then again: an informative error message will be thrown.

I may be wrong, though. Is there any indication in Rosso et al. (2007) that specifies that the distribution must have been obtained using permutation-based symbolization? Or does any symbolization work? (e.g. discretization using binning is also a form of symbolization)

ikottlarz commented 1 year ago

The API doesn't specify what the return value is (see documentation string for ComplexityMeasures.complexity). It simply states that the complexity measure is returned. If the "statistical complexity" is a quantity that consists of a tuple of two separate quantities (complexity, measure) , then the entirety of that tuple is the complexity measure. That's my take, at least. What do you think, @Datseris?

I changed it now so that the StatisticalComplexity type holds the entropy value as a Ref value so it can be accessed this way. For convenience, it might make sense though to add a entropy_complexity function that calls complexity and then returns both values as a Tuple?

I may be wrong, though. Is there any indication in Rosso et al. (2007) that specifies that the distribution must have been obtained using permutation-based symbolization? Or does any symbolization work? (e.g. discretization using binning is also a form of symbolization)

They don't actually specify this, and in fact suggest other means of obtaining PDFs in the 2013 paper. But I haven't yet figured out how to implement this for distributions that don't have a fixed number of bins. The equations in Martin et al 2006 (as well as the algorithm in statcomp, which is just an implementation of these equations) assume that this is the case as far as I can tell.

codecov[bot] commented 1 year ago

Codecov Report

Merging #260 (b8dbb5b) into main (3298d33) will increase coverage by 0.93%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #260      +/-   ##
==========================================
+ Coverage   81.14%   82.08%   +0.93%     
==========================================
  Files          49       50       +1     
  Lines        1241     1306      +65     
==========================================
+ Hits         1007     1072      +65     
  Misses        234      234

Impacted Files	Coverage Δ
src/complexity_measures/statistical_complexity.jl	`100.00% <100.00%> (ø)`
src/probabilities.jl	`89.28% <100.00%> (ø)`

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

kahaaga commented 1 year ago

But I haven't yet figured out how to implement this for distributions that don't have a fixed number of bins.

I read the Martin et al paper, and as far as I can see, they just refer to generic probability distributions with a fixed number of elements, so we just do the same here. Implementing the algorithm for non-fixed number of bins is far beyond the scope of this PR, and would probably be a good research paper in itself ((if it's even possible to do?).

You don't actually need to implement anything here.

If the ProbabilitiesEstimator has an implementation for total_outcomes, it means that the number if bins is known ahead of time. So your implementation will work as long as this is the case. If the number of outcomes is not ahead of time, then an error message about that will be thrown.

You can just add a note in the docs saying that any instance of a ProbabilitiesEstimator will work, as long as the size of outcome space is deducible from the estimator.

kahaaga commented 1 year ago

I changed it now so that the StatisticalComplexity type holds the entropy value as a Ref value so it can be accessed this way. For convenience, it might make sense though to add a entropy_complexity function that calls complexity and then returns both values as a Tuple?

I like that. Then the wrapper is just something like

 function entropy_complexity(c::StatisticalComplexity, x)
     r = complexity(c, x)
    return (c.entropyval, r) # or whatever the field is called)
 end

It should be noted in the docstring of StatisticalComplexity how the entropy value can be accessed, and it should refer to the wrapper too. Do you agree that it is appropriate to introduce a wrapper here, @Datseris? Or should we just return both the entropy and the complexity as a tuple directly?

kahaaga commented 1 year ago

When I think about it... I kind of lean towards just returning both directly, since the statistical complexity is intrinsically linked to both values. That results in cleaner code and less convenience functions that are not really convenient, but just introduces another way of doing exactly the same thing

kahaaga commented 1 year ago

It would also be nice to have a documentation example for StatisticalComplexity, for example reproducing one of the examples from the Martin et al paper. But it can also be contributed later. If you don't include an example here, @ikottlarz, could you please open an issue for it? That would be a nice beginner issue.

Datseris commented 1 year ago

I've creatred a simple script of how I envisioned using this functionality to scatter plot paintings onto a complexity entropy plane (which is excactly what someone else did in a talk i just saw):

using ComplexityMeasures # our software for analyzing complexity
using Images # package for loading and processing images from files

# paths to three paintings:
paintings = [desktop("painting_$(i).jpg") for i in 1:3]

# load imagees into matrices of pixels
images = [load(p) for p in paintings]

# convert to grayscale
gray_images = [Gray.(image) for image in images]

# start a figure to plot the complexity entropy plane:
fig = Figure()
ax = Axis(fig[1,1]; xlabel = "entropy", ylabel = "complexity")

# define how to extract entropy and complexity for these images
# First, define how to extract entropy from image. Here, 2-by-2 symbolic permutation
stencil = ((2,2), (1,1))
E = SpatialSymbolicPermutation(
    stencil, gray_images[1]; periodic = false
)
# Then, decide the complexity measure. Here, as defined in ...
C = StatisicalComplexity(E)

# Loop over the paintings and plot them:
for gimg in gray_images
    compl, ent = complexity_entropy(C, gimg)
    scatter!(ax, ent, compl)
end

# add the curves as well for guiding the eye
lower, upper = complexity_entropy_curves(C)
lines!(ax, lower; color = "gray")
lines!(ax, upper; color = "gray")

Datseris commented 1 year ago

Based on the above, our complexity value api should remain as is. It returns a real number. We should specify this explicitly in complexity actually. A wrapper function as @kahaaga said returns both values. Note that you need to do x.entr_val[] with [] because of the Ref.

The functions that return the curves return two Vector{SVector} so that they can be plotted with makjie directly.

kahaaga commented 1 year ago

I've creatred a simple script of how I envisioned using this functionality to scatter plot paintings onto a complexity entropy plane (which is excactly what someone else did in a talk i just saw):

using ComplexityMeasures # our software for analyzing complexity
using Images # package for loading and processing images from files

# paths to three paintings:
paintings = [desktop("painting_$(i).jpg") for i in 1:3]

# load imagees into matrices of pixels
images = [load(p) for p in paintings]

# convert to grayscale
gray_images = [Gray.(image) for image in images]

# start a figure to plot the complexity entropy plane:
fig = Figure()
ax = Axis(fig[1,1]; xlabel = "entropy", ylabel = "complexity")

# define how to extract entropy and complexity for these images
# First, define how to extract entropy from image. Here, 2-by-2 symbolic permutation
stencil = ((2,2), (1,1))
E = SpatialSymbolicPermutation(
    stencil, gray_images[1]; periodic = false
)
# Then, decide the complexity measure. Here, as defined in ...
C = StatisicalComplexity(E)

# Loop over the paintings and plot them:
for gimg in gray_images
    compl, ent = complexity_entropy(C, gimg)
    scatter!(ax, ent, compl)
end

# add the curves as well for guiding the eye
lower, upper = complexity_entropy_curves(C)
lines!(ax, lower; color = "gray")
lines!(ax, upper; color = "gray")

For the documentation, we already load a package for sample images. It would be nice to showcase this example, just using a few of those example images instead.

kahaaga commented 1 year ago

Based on the above, our complexity value api should remain as is. It returns a real number. We should specify this explicitly in complexity actually.

Yes, the docstring doesn't specify that complexity returns a single value at the moment, so this should be fixed.

A wrapper function as @kahaaga said returns both values. Note that you need to do x.entr_val[]

Ok, let's go for the convenience wrapper function then.

Datseris commented 1 year ago

For the documentation, we already load a package for sample images. It would be nice to showcase this example, just using a few of those example images instead.

okay, I'll contribute this. I know the author of the woirk that did this for paintings and I'll cite it in the excample.

kahaaga commented 1 year ago

This is good to go from my side @ikottlarz, when:

The wrapper function is in place (and is documented + tested)
The docstring clearly specifies what sort of inspiration you drew from the R code, and avoids words like "adapted", unless you actually copy-pasted code and adapted it for Julia (which you clearly didn't, since you use our own API, and the code looks nothing like the R code).

okay, I'll contribute this. I know the author of the woirk that did this for paintings and I'll cite it in the excample.

Excellent! Thanks, @Datseris. You can either contribute it directly to this branch, or do it in a separate PR.

ikottlarz commented 1 year ago

@Datseris I think this is ready to merge, I adjusted the two points @kahaaga had left. I'll contribute an example from the Rosso paper in a separate PR later.

JuliaDynamics / ComplexityMeasures.jl

Statistical complexity #260

Codecov Report