Closed Datseris closed 1 year ago
I think the docs won't build until I merge. Locally everything is fine. @kahaaga what is our current estimate of "X", the amount of quantities this package can compute?
what is our current estimate of "X", the amount of quantities this package can compute?
Let's estimate a lower bound:
# Recursively find all concrete subtypes of `T`.
# A variant of https://discourse.julialang.org/t/counting-all-subtypes/38797/6
function countsubtypes(T::Type; recursive=true)
s = subtypes(T)
abstract_types = filter(x -> Base.isabstracttype(x), s)
concrete_types = filter(x -> x ∉ abstract_types, s)
n = length(concrete_types)
if !recursive || n == 0
return n
end
if isempty(abstract_types)
return n
else
return n + sum(countsubtypes(_T, recursive=true) for _T in s)
end
end;
function measurecount_lowerbound()
# Discrete quantities
n_multiscale = countsubtypes(ComplexityMeasures.MultiScaleAlgorithm);
n_complexity_measures = countsubtypes(ComplexityMeasure) * n_multiscale;
n_probests = countsubtypes(ProbabilitiesEstimator);
n_entropy_types = countsubtypes(EntropyDefinition);
n_entropies = n_entropy_types * n_probests * n_multiscale;
# Differential entropies (these only work for Shannon for now, so no multiplying)
n_diffentests = countsubtypes(DifferentialEntropyEstimator);
return n_entropies, n_diffentests, n_probests
end
julia> measurecount_lowerbound() |> sum
178
So currently 178 unique measures, when including each multiscale variant as its own "thing". EDIT: I have at least two more probabilities estimators as part of CausalityTools atm, so I think this number will be >200 once everything is ported here.
The real number is higher, because methods appear under different names and might be considered "unique" concepts based on types of input data (univariate/multivariate). Many of our estimators, through keyword arguments, can also be tweaked to behave in certain ways that might count as estimators of their own, if judging from scientific papers.
EDIT: we can probably present this information on the front page in a figure or something. It's possible to hide code blocks with Documenter, so we can just do the above calculation and show a figure summarizing the numbers.
The git diff is massive but the changes are pretty minimal:
I created a separate PR (https://github.com/JuliaDynamics/ComplexityMeasures.jl/pull/235) for my review, so it is easier to see the changes I propose. Overall this looks good, but I did some minor changes and fixed some minor stuff:
ComplexityMeasure
suddenly appeared mid-sentence when talking about entropies.I think the docs won't build until I merge.
The Project.toml
file for the documentation had the old UUID. I have fixed this in my PR. It should build now, once we register the package, I think. It works locally for me too.
Take-away: This is good to go when #235 is merged.
I don't think its appropriate to multiply the amount of multiscale algorithms with all measures and say we have more, because one has to first show that this would give a meaningful and useful result. Like, the multiscale of the valuehistogram is clearly bad: coarse sampling the data strictly reduces the accuracy of the histogram and hence its entropy. Also, downscaling in this case means subsampling. Estimators that do not care about the temporal order of the data also don't apprecaite downscaling. The same is true for symbolic permutation of a dataset (not a delay embedding).
n_probests = countsubtypes(ProbabilitiesEstimator);
is counting less, because some estimators (e.g., symbolic) are subtypes of another supertype that is subtype of the probabilities.
Finally, multiscale isn't part of the v2 release, and I care about this number to quote it in the update message, so let's get the conservative estimate for now.
I think the docs won't build because ChaosTools relies on the old Entropies. I'll push a commit now that makes it rely on the ComplexityMeasures.
I don't think its appropriate to multiply the amount of multiscale algorithms with all measures and say we have more, because one has to first show that this would give a meaningful and useful result
We don't make any claims on the validity or usefulness of the results for arbitrary combinations of measures. What we do provided is the possibility for arbitrary combinations, many of which have yet to be explored.
We could never explore the usefulness of all the possible combinations that would exist here. It would simply be too large an effort. The only thing we promise is a modular design where you (in the future, when we resolve #223) can plug any complexity measure in with any multiscale sampling approach (and potentially other sampling schemes, if we decide to implement that).
This review on multiscale measures show some of the possibilities that have been explored. There exist many variants of the presented multiscale schemes. Which of them are "useful" depends entirely on the data you have, the questions you ask and the acceptable robustness of the results. It is not up to us to a priori answer the question of usefulness, because we can't.
Like, the multiscale of the valuehistogram is clearly bad: coarse sampling the data strictly reduces the accuracy of the histogram and hence its entropy.
"Bad" may not be the most appropriate word here. It is context and data dependent. If my time series are from process for which I've sampled 10^7 data points, and my intention is to compute some sort of multiscale entropy, then by virtue of just having a lot of data points, the histogram approach is perfectly fine to approximate the multiscale entropy of this process, provided I terminate the procedure at an appropriate scale level. If I only had 100 points, however, one could argue it would be a "bad" idea to apply the multiscale algorithm with a histogram approach beyond the second scale (or whatever scale).
I'd be perfectly fine applying any coarse graining procedure with any multiscale algorithm if I had a year of per-second-resolution data . I probably wouldn't be doing so if I had per-week data. This is a judgement that is entirely up to the user.
Also, downscaling in this case means subsampling. Estimators that do not care about the temporal order of the data also don't apprecaite downscaling. The same is true for symbolic permutation of a dataset (not a delay embedding).
I'm not sure what you mean by "appreciate downscaling".
The downsampling methods are not simply picking a subset of values from the original time series. The downsampling methods apply some function to subsets of the original data to create coarse-grained data. This function can be completely arbitrary, and may alter amplitudes and ordinal patterns of the coarse-grained time series. SymbolicPermutation
and friends would therefore be sensitive to the downscaling.
Common choices for this function are mean
and std
, but you could easily define arbitrary functions that transforms the data in arbitrary ways that preserve/don't preserve any property of the data, which may or may not affect the result for any probabilities/complexity estimator.
Using SymbolicPermutation
or one of the other symbolic estimators would just be a slightly fancier way of doing SymbolicPermutation
with embedding lag greater than one. Nothing prevents you from doing so, but if it is a good idea: that is context dependent.
I think the docs won't build because ChaosTools relies on the old Entropies. I'll push a commit now that makes it rely on the ComplexityMeasures.
There was a typo in your latest commit. I fixed it. The documentation now builds successfully.
Github Actions is still not linking to the deployed documentation, though, which is very annoying. But we can fix this later.
I also updated the link to the docs in the "about" section on the main repo page.
Shall we merge and register?
I care about this number to quote it in the update message
For the update message: A manual count gives 13 probability estimators * 6 entropies = 78 discrete entropy variants. With 4 complexity measures, thats 82 unique discrete measures at the moment. On top of that, we have 8 estimators of differential Shannon entropy.
Okay, you've convinced me of the multiscale argument, but still it is not yet in the package so we shouldnt do the multiplication yet. So we stay at the 82 or so for the update message and we can write the larger estimate in the paper.
Okay, you've convinced me of the multiscale argument, but still it is not yet in the package so we shouldnt do the multiplication yet. So we stay at the 82 or so for the update message and we can write the larger estimate in the paper.
Ok, let's stick with the conservative estimate here.
The git diff is massive but the changes are pretty minimal:
Closes #198