Closed ikottlarz closed 1 year ago
In the current API, complexity returns only one value. The statistical complexity is directly dependent on a corresponding information measure and therefore only used in combination with it. Currently, I'm returning only the statistical complexity although I calculate the information measure as well. Is this the way to go?
The API doesn't specify what the return value is (see documentation string for ComplexityMeasures.complexity
). It simply states that the complexity measure is returned. If the "statistical complexity" is a quantity that consists of a tuple of two separate quantities (complexity
, measure
) , then the entirety of that tuple is the complexity measure. That's my take, at least. What do you think, @Datseris?
The algorithm for the "minimum-/ maximum complexity curves" assumes that we have some ProbabilitiesEstimator where the total_outcomes are known a priori, without a specific time series being needed. Since we allow for generic ProbabilitiesEstimators here, I'm currently throwing an ArgumentError if the ProbabilitiesEstimator is not the "classical" SymbolicPermutation (or derived estimator). Should this be extended to general estimators before the merge (I currently don't have a clear idea how)?
Judging from the code, it seems to me that the statistical complexity is simply a function of a probability distribution. If so, it should work (without any extra effort from your side when coding) with any ProbabilitiesEstimator
.
If my understanding is correct, you don't actually need to implement anything extra for the algorithm to be completely generic. Just remove the ArgumentError
, and you'll be good to go. The only potential issue is if entropy_normalized
isn't defined for the particular ProbabilitiesEstimator
, but then an informative error message will be thrown when the line involving entropy_normalized
is called. Likewise, the algorithm may fail if the ProbabilitiesEstimator
doesn't have an implementation for total_outcomes
, but then again: an informative error message will be thrown.
I may be wrong, though. Is there any indication in Rosso et al. (2007) that specifies that the distribution must have been obtained using permutation-based symbolization? Or does any symbolization work? (e.g. discretization using binning is also a form of symbolization)
The API doesn't specify what the return value is (see documentation string for
ComplexityMeasures.complexity
). It simply states that the complexity measure is returned. If the "statistical complexity" is a quantity that consists of a tuple of two separate quantities (complexity
,measure
) , then the entirety of that tuple is the complexity measure. That's my take, at least. What do you think, @Datseris?
I changed it now so that the StatisticalComplexity
type holds the entropy value as a Ref
value so it can be accessed this way. For convenience, it might make sense though to add a entropy_complexity
function that calls complexity
and then returns both values as a Tuple
?
I may be wrong, though. Is there any indication in Rosso et al. (2007) that specifies that the distribution must have been obtained using permutation-based symbolization? Or does any symbolization work? (e.g. discretization using binning is also a form of symbolization)
They don't actually specify this, and in fact suggest other means of obtaining PDFs in the 2013 paper. But I haven't yet figured out how to implement this for distributions that don't have a fixed number of bins. The equations in Martin et al 2006 (as well as the algorithm in statcomp, which is just an implementation of these equations) assume that this is the case as far as I can tell.
Merging #260 (b8dbb5b) into main (3298d33) will increase coverage by
0.93%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #260 +/- ##
==========================================
+ Coverage 81.14% 82.08% +0.93%
==========================================
Files 49 50 +1
Lines 1241 1306 +65
==========================================
+ Hits 1007 1072 +65
Misses 234 234
Impacted Files | Coverage Δ | |
---|---|---|
src/complexity_measures/statistical_complexity.jl | 100.00% <100.00%> (ø) |
|
src/probabilities.jl | 89.28% <100.00%> (ø) |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
But I haven't yet figured out how to implement this for distributions that don't have a fixed number of bins.
I read the Martin et al paper, and as far as I can see, they just refer to generic probability distributions with a fixed number of elements, so we just do the same here. Implementing the algorithm for non-fixed number of bins is far beyond the scope of this PR, and would probably be a good research paper in itself ((if it's even possible to do?).
You don't actually need to implement anything here.
If the ProbabilitiesEstimator
has an implementation for total_outcomes
, it means that the number if bins is known ahead of time. So your implementation will work as long as this is the case. If the number of outcomes is not ahead of time, then an error message about that will be thrown.
You can just add a note in the docs saying that any instance of a ProbabilitiesEstimator
will work, as long as the size of outcome space is deducible from the estimator.
I changed it now so that the
StatisticalComplexity
type holds the entropy value as a Ref value so it can be accessed this way. For convenience, it might make sense though to add a entropy_complexity function that calls complexity and then returns both values as a Tuple?
I like that. Then the wrapper is just something like
function entropy_complexity(c::StatisticalComplexity, x)
r = complexity(c, x)
return (c.entropyval, r) # or whatever the field is called)
end
It should be noted in the docstring of StatisticalComplexity
how the entropy value can be accessed, and it should refer to the wrapper too. Do you agree that it is appropriate to introduce a wrapper here, @Datseris? Or should we just return both the entropy and the complexity as a tuple directly?
When I think about it... I kind of lean towards just returning both directly, since the statistical complexity is intrinsically linked to both values. That results in cleaner code and less convenience functions that are not really convenient, but just introduces another way of doing exactly the same thing
It would also be nice to have a documentation example for StatisticalComplexity
, for example reproducing one of the examples from the Martin et al paper. But it can also be contributed later. If you don't include an example here, @ikottlarz, could you please open an issue for it? That would be a nice beginner issue.
I've creatred a simple script of how I envisioned using this functionality to scatter plot paintings onto a complexity entropy plane (which is excactly what someone else did in a talk i just saw):
using ComplexityMeasures # our software for analyzing complexity
using Images # package for loading and processing images from files
# paths to three paintings:
paintings = [desktop("painting_$(i).jpg") for i in 1:3]
# load imagees into matrices of pixels
images = [load(p) for p in paintings]
# convert to grayscale
gray_images = [Gray.(image) for image in images]
# start a figure to plot the complexity entropy plane:
fig = Figure()
ax = Axis(fig[1,1]; xlabel = "entropy", ylabel = "complexity")
# define how to extract entropy and complexity for these images
# First, define how to extract entropy from image. Here, 2-by-2 symbolic permutation
stencil = ((2,2), (1,1))
E = SpatialSymbolicPermutation(
stencil, gray_images[1]; periodic = false
)
# Then, decide the complexity measure. Here, as defined in ...
C = StatisicalComplexity(E)
# Loop over the paintings and plot them:
for gimg in gray_images
compl, ent = complexity_entropy(C, gimg)
scatter!(ax, ent, compl)
end
# add the curves as well for guiding the eye
lower, upper = complexity_entropy_curves(C)
lines!(ax, lower; color = "gray")
lines!(ax, upper; color = "gray")
Based on the above, our complexity
value api should remain as is. It returns a real number. We should specify this explicitly in complexity
actually. A wrapper function as @kahaaga said returns both values. Note that you need to do x.entr_val[]
with []
because of the Ref
.
The functions that return the curves return two Vector{SVector}
so that they can be plotted with makjie directly.
I've creatred a simple script of how I envisioned using this functionality to scatter plot paintings onto a complexity entropy plane (which is excactly what someone else did in a talk i just saw):
using ComplexityMeasures # our software for analyzing complexity using Images # package for loading and processing images from files # paths to three paintings: paintings = [desktop("painting_$(i).jpg") for i in 1:3] # load imagees into matrices of pixels images = [load(p) for p in paintings] # convert to grayscale gray_images = [Gray.(image) for image in images] # start a figure to plot the complexity entropy plane: fig = Figure() ax = Axis(fig[1,1]; xlabel = "entropy", ylabel = "complexity") # define how to extract entropy and complexity for these images # First, define how to extract entropy from image. Here, 2-by-2 symbolic permutation stencil = ((2,2), (1,1)) E = SpatialSymbolicPermutation( stencil, gray_images[1]; periodic = false ) # Then, decide the complexity measure. Here, as defined in ... C = StatisicalComplexity(E) # Loop over the paintings and plot them: for gimg in gray_images compl, ent = complexity_entropy(C, gimg) scatter!(ax, ent, compl) end # add the curves as well for guiding the eye lower, upper = complexity_entropy_curves(C) lines!(ax, lower; color = "gray") lines!(ax, upper; color = "gray")
For the documentation, we already load a package for sample images. It would be nice to showcase this example, just using a few of those example images instead.
Based on the above, our complexity value api should remain as is. It returns a real number. We should specify this explicitly in complexity actually.
Yes, the docstring doesn't specify that complexity
returns a single value at the moment, so this should be fixed.
A wrapper function as @kahaaga said returns both values. Note that you need to do x.entr_val[]
Ok, let's go for the convenience wrapper function then.
For the documentation, we already load a package for sample images. It would be nice to showcase this example, just using a few of those example images instead.
okay, I'll contribute this. I know the author of the woirk that did this for paintings and I'll cite it in the excample.
This is good to go from my side @ikottlarz, when:
okay, I'll contribute this. I know the author of the woirk that did this for paintings and I'll cite it in the excample.
Excellent! Thanks, @Datseris. You can either contribute it directly to this branch, or do it in a separate PR.
@Datseris I think this is ready to merge, I adjusted the two points @kahaaga had left. I'll contribute an example from the Rosso paper in a separate PR later.
Hi there,
I just added the (generalized version) of the statistical complexity. There are two open questions:
complexity
returns only one value. The statistical complexity is directly dependent on a corresponding information measure and therefore only used in combination with it. Currently, I'm returning only the statistical complexity although I calculate the information measure as well. Is this the way to go?ProbabilitiesEstimator
where thetotal_outcomes
are known a priori, without a specific time series being needed. Since we allow for genericProbabilitiesEstimators
here, I'm currently throwing anArgumentError
if theProbabilitiesEstimator
is not the "classical"SymbolicPermutation
(or derived estimator). Should this be extended to general estimators before the merge (I currently don't have a clear idea how)?