Rename everything to ComplexityMeasures.jl

Datseris commented 1 year ago

The git diff is massive but the changes are pretty minimal:

minor doc changes
rename Entropies.jl to ComplexityMeasures.jl
rename Entropies to ComplexityMeasures
add new UUID

Closes #198

Datseris commented 1 year ago

I think the docs won't build until I merge. Locally everything is fine. @kahaaga what is our current estimate of "X", the amount of quantities this package can compute?

kahaaga commented 1 year ago

what is our current estimate of "X", the amount of quantities this package can compute?

Let's estimate a lower bound:

# Recursively find all concrete subtypes of `T`.
#  A variant of https://discourse.julialang.org/t/counting-all-subtypes/38797/6
function countsubtypes(T::Type; recursive=true)
    s = subtypes(T)
    abstract_types = filter(x -> Base.isabstracttype(x), s)
    concrete_types = filter(x -> x ∉ abstract_types, s)
    n = length(concrete_types)
    if !recursive || n == 0
        return n
    end
    if isempty(abstract_types)
        return n
    else
        return n + sum(countsubtypes(_T, recursive=true) for _T in s)
    end
end;

function measurecount_lowerbound()
    # Discrete quantities
    n_multiscale = countsubtypes(ComplexityMeasures.MultiScaleAlgorithm);
    n_complexity_measures = countsubtypes(ComplexityMeasure) * n_multiscale;

    n_probests = countsubtypes(ProbabilitiesEstimator);
    n_entropy_types = countsubtypes(EntropyDefinition);
    n_entropies = n_entropy_types * n_probests * n_multiscale;

    # Differential entropies (these only work for Shannon for now, so no multiplying)
    n_diffentests = countsubtypes(DifferentialEntropyEstimator);
    return n_entropies, n_diffentests, n_probests
end

julia> measurecount_lowerbound() |> sum
178

So currently 178 unique measures, when including each multiscale variant as its own "thing". EDIT: I have at least two more probabilities estimators as part of CausalityTools atm, so I think this number will be >200 once everything is ported here.

The real number is higher, because methods appear under different names and might be considered "unique" concepts based on types of input data (univariate/multivariate). Many of our estimators, through keyword arguments, can also be tweaked to behave in certain ways that might count as estimators of their own, if judging from scientific papers.

EDIT: we can probably present this information on the front page in a figure or something. It's possible to hide code blocks with Documenter, so we can just do the above calculation and show a figure summarizing the numbers.

kahaaga commented 1 year ago

The git diff is massive but the changes are pretty minimal:

I created a separate PR (https://github.com/JuliaDynamics/ComplexityMeasures.jl/pull/235) for my review, so it is easier to see the changes I propose. Overall this looks good, but I did some minor changes and fixed some minor stuff:

Rewrote parts of intro page and reorganized the text a little. I struggled a bit to get the take-away the first couple of times I read properly through it. Therefore I tried to make it a bit more flow-y, and link the text a bit more tightly to what the user actually sees when they enter the "probabilities" and "entropies" sections. The content and text is mostly the same, just structured a bit more (in my opinion) intuitively, and reads better. Feel free to change some of it if you don't like it.
Fix some find-and-replace errors where ComplexityMeasure suddenly appeared mid-sentence when talking about entropies.
Some types
Minor docstring fixes.
Documentation markdown formatting.

I think the docs won't build until I merge.

The Project.toml file for the documentation had the old UUID. I have fixed this in my PR. It should build now, once we register the package, I think. It works locally for me too.

Take-away: This is good to go when #235 is merged.

Datseris commented 1 year ago

I don't think its appropriate to multiply the amount of multiscale algorithms with all measures and say we have more, because one has to first show that this would give a meaningful and useful result. Like, the multiscale of the valuehistogram is clearly bad: coarse sampling the data strictly reduces the accuracy of the histogram and hence its entropy. Also, downscaling in this case means subsampling. Estimators that do not care about the temporal order of the data also don't apprecaite downscaling. The same is true for symbolic permutation of a dataset (not a delay embedding).

    n_probests = countsubtypes(ProbabilitiesEstimator);

is counting less, because some estimators (e.g., symbolic) are subtypes of another supertype that is subtype of the probabilities.

Finally, multiscale isn't part of the v2 release, and I care about this number to quote it in the update message, so let's get the conservative estimate for now.

Datseris commented 1 year ago

I think the docs won't build because ChaosTools relies on the old Entropies. I'll push a commit now that makes it rely on the ComplexityMeasures.

kahaaga commented 1 year ago

I don't think its appropriate to multiply the amount of multiscale algorithms with all measures and say we have more, because one has to first show that this would give a meaningful and useful result

We don't make any claims on the validity or usefulness of the results for arbitrary combinations of measures. What we do provided is the possibility for arbitrary combinations, many of which have yet to be explored.

We could never explore the usefulness of all the possible combinations that would exist here. It would simply be too large an effort. The only thing we promise is a modular design where you (in the future, when we resolve #223) can plug any complexity measure in with any multiscale sampling approach (and potentially other sampling schemes, if we decide to implement that).

This review on multiscale measures show some of the possibilities that have been explored. There exist many variants of the presented multiscale schemes. Which of them are "useful" depends entirely on the data you have, the questions you ask and the acceptable robustness of the results. It is not up to us to a priori answer the question of usefulness, because we can't.

Like, the multiscale of the valuehistogram is clearly bad: coarse sampling the data strictly reduces the accuracy of the histogram and hence its entropy.

"Bad" may not be the most appropriate word here. It is context and data dependent. If my time series are from process for which I've sampled 10^7 data points, and my intention is to compute some sort of multiscale entropy, then by virtue of just having a lot of data points, the histogram approach is perfectly fine to approximate the multiscale entropy of this process, provided I terminate the procedure at an appropriate scale level. If I only had 100 points, however, one could argue it would be a "bad" idea to apply the multiscale algorithm with a histogram approach beyond the second scale (or whatever scale).

I'd be perfectly fine applying any coarse graining procedure with any multiscale algorithm if I had a year of per-second-resolution data . I probably wouldn't be doing so if I had per-week data. This is a judgement that is entirely up to the user.

Also, downscaling in this case means subsampling. Estimators that do not care about the temporal order of the data also don't apprecaite downscaling. The same is true for symbolic permutation of a dataset (not a delay embedding).

I'm not sure what you mean by "appreciate downscaling".

The downsampling methods are not simply picking a subset of values from the original time series. The downsampling methods apply some function to subsets of the original data to create coarse-grained data. This function can be completely arbitrary, and may alter amplitudes and ordinal patterns of the coarse-grained time series. SymbolicPermutation and friends would therefore be sensitive to the downscaling.

Common choices for this function are mean and std, but you could easily define arbitrary functions that transforms the data in arbitrary ways that preserve/don't preserve any property of the data, which may or may not affect the result for any probabilities/complexity estimator.

Using SymbolicPermutation or one of the other symbolic estimators would just be a slightly fancier way of doing SymbolicPermutation with embedding lag greater than one. Nothing prevents you from doing so, but if it is a good idea: that is context dependent.

kahaaga commented 1 year ago

I think the docs won't build because ChaosTools relies on the old Entropies. I'll push a commit now that makes it rely on the ComplexityMeasures.

There was a typo in your latest commit. I fixed it. The documentation now builds successfully.

Github Actions is still not linking to the deployed documentation, though, which is very annoying. But we can fix this later.

I also updated the link to the docs in the "about" section on the main repo page.

Shall we merge and register?

kahaaga commented 1 year ago

I care about this number to quote it in the update message

For the update message: A manual count gives 13 probability estimators * 6 entropies = 78 discrete entropy variants. With 4 complexity measures, thats 82 unique discrete measures at the moment. On top of that, we have 8 estimators of differential Shannon entropy.

Datseris commented 1 year ago

Okay, you've convinced me of the multiscale argument, but still it is not yet in the package so we shouldnt do the multiplication yet. So we stay at the 82 or so for the update message and we can write the larger estimate in the paper.

kahaaga commented 1 year ago

Okay, you've convinced me of the multiscale argument, but still it is not yet in the package so we shouldnt do the multiplication yet. So we stay at the 82 or so for the update message and we can write the larger estimate in the paper.

Ok, let's stick with the conservative estimate here.

JuliaDynamics / ComplexityMeasures.jl

Rename everything to ComplexityMeasures.jl #234