JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems
MIT License
54 stars 12 forks source link

Review of codebase and docs - Probabilities and Encodings- Datseris #213

Closed Datseris closed 1 year ago

Datseris commented 1 year ago

This is a review of all probabilities estimators except the Transfer Operator for v2. Once this PR is merged, all probabilities estimators except the Transfer Operator are "set in stone" for v2 and will not get any further changes. So it is important @kahaaga once we merge this to both agree that this is "done" and commit to this doneness so that we can actually get some v2 out at some point in during our lifespans, because if we don't commit I am sure I will be finding out things to change every week given my history with the repo...

Alright, here I outline the major changes did:

Datseris commented 1 year ago

Name sortperm is used in the docstring of OrdinalPattern encoding but is not anywhere in the docs. I'll probably re-think how all of this is exposed in this PR.

codecov[bot] commented 1 year ago

Codecov Report

Merging #213 (7f77993) into main (3de9294) will decrease coverage by 0.16%. The diff coverage is 91.66%.

@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   84.63%   84.47%   -0.17%     
==========================================
  Files          49       45       -4     
  Lines        1191     1140      -51     
==========================================
- Hits         1008      963      -45     
+ Misses        183      177       -6     
Impacted Files Coverage Δ
src/encoding/rectangular_binning.jl 88.52% <ø> (ø)
src/entropy.jl 79.31% <ø> (+11.66%) :arrow_up:
src/probabilities.jl 80.76% <ø> (ø)
.../probabilities_estimators/dispersion/dispersion.jl 89.28% <ø> (ø)
...abilities_estimators/histograms/value_histogram.jl 87.50% <ø> (ø)
src/probabilities_estimators/spatial/utils.jl 94.11% <ø> (ø)
src/encoding/ordinal_pattern.jl 80.00% <75.00%> (-6.05%) :arrow_down:
src/encoding/gaussian_cdf.jl 61.53% <80.00%> (+4.39%) :arrow_up:
.../spatial_permutation/SpatialSymbolicPermutation.jl 70.27% <81.25%> (-0.19%) :arrow_down:
...mators/permutation_ordinal/symbolic_permutation.jl 94.00% <94.00%> (ø)
... and 3 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

kahaaga commented 1 year ago

Name sortperm is used in the docstring of OrdinalPattern encoding but is not anywhere in the docs. I'll probably re-think how all of this is exposed in this PR.

Maybe just say Base.sortperm, and assume that a user would understand to look at Julia's Base docs?

Datseris commented 1 year ago

aaaaaaaaaaaaaaaaah haha my bad. I did'think about it and thought it was the internal function that does the lehmer code thingy...

Datseris commented 1 year ago

@kahaaga I would like to remove entropy!(s, e::Entropy, p::ProbEst, x) from the API. It just calls probabilities!(s, ...) and then calls the normal entropy function. It doesn't serve any actual optimization purpose. A user that cares about getting this small optimization can live with calling probabilities! and explicitly passing the result to entropy.

Datseris commented 1 year ago

I would also remove encodings_from_permutations it is an internal function without purpose, as it just becomes encodings_from_permutations! later. Only encodings_from_permutations! should exist. This is rather simple to remove.

Datseris commented 1 year ago

@kahaaga the OrdinalPatternEncoding still does not match the declared api. The API says that encode takes in an element of the input data. Here should take in an SVector, an element of the input dataset (or delay embedded timeseries). But that's not what it takes in. I want to change this to do what the API says, which is the most useful for a user, who shouldn't have to care of creating the intermedaite sortperm step. Which is also super expensive.

EDIT: In fact, I am convinced that once the encoding does as its supposed to do, we will save a lot of lines of source code. Just like we do in the histogram encoding. In essense, applyling the estimator SymbolicPermutation will be a straightfrward call of map(x -> encode(c, x), input_dataset). with c an instance of OrdinalPatternEncoding.


Here is my plan on how to change it. First, the entirety of the common.jl file is deleted. Then, perm = @MVector zeros(Int, m) becomes a field of OrdinalPatternEncoding. Then, we define:

function encode(encoding::OrdinalPatternEncoding, x::SVector{m}) # x already element of input dataset 
    perm = sortperm!(encoding.dummyperm, x, lt = est.lt)
    m = encoding.m
    n = 0
    for i = 1:m-1
        for j = i+1:m
            n += perm[i] > perm[j] ? 1 : 0
        end
        n = (m-i)*n
    end
    # The Lehmer code actually results in 0 being an encoded symbol. Shift by 1, so that
    # encodings are positive integers.
    return n + 1
end

and BOOM, we saved 200 lines of code.


hence, like in ValueHistogram the source code of the estimator becomes: (1) initialize the encoding and make it your only field, (2) call encode mapping over elements of input. (3) call probabilities(encoded_result).

kahaaga commented 1 year ago

I would like to remove entropy!(s, e::Entropy, p::ProbEst, x) from the API. It just calls probabilities!(s, ...) and then calls the normal entropy function. It doesn't serve any actual optimization purpose. A user that cares about getting this small optimization can live with calling probabilities! and explicitly passing the result to entropy.

Ok with me! We just need to remember to remove all references to entropy! from all docstrings.

I would also remove encodings_from_permutations it is an internal function without purpose, as it just becomes encodings_from_permutations! later. Only encodings_from_permutations! should exist. This is rather simple to remove.

Also fine with me.

... Here is my plan on how to change it. First, the entirety of the common.jl file is deleted. Then, perm = @MVector zeros(Int, m) becomes a field of OrdinalPatternEncoding. Then, we define:

Excellent. A much better solution. Go ahead!

Datseris commented 1 year ago

@kahaaga have a look at the new SymbolicPermutation.jl and ordinal_pattern files in latest commit. I think the simplification is massive. The only thing left now is to think about how to incorporate the "weighted" or "amplitude aware" versions to the picture.

maybe the loop

    # TODO: The following loop can probably be parallelized!
    @inbounds for (i, χ) in enumerate(x)
        πs[i] = encode(est.encoding, χ)
    end

is altered to add weights to πs? like, something as simple as πs[i] = encode(est.encoding, χ) * permweight(est, χ)? Or if all of x is required, precoompute the weigts as pweights = permweights(est, x) and then multiply πs[i] with pweights[i]?

EDIT: Seems to me that the weight of any estimator is obtained from the data point itself and does not need additional data points. So, each data point χ generates its own weight without needing the remaining data points in the Dataset. So this is super easy, I just make a function permweight!(π, χ, est) that adds weight to π = πs[i] according to the estimator.

kahaaga commented 1 year ago

@kahaaga have a look at the new SymbolicPermutation.jl and ordinal_pattern files in latest commit. I think the simplification is massive. The only thing left now is to think about how to incorporate the "weighted" or "amplitude aware" versions to the picture.

maybe the loop

    # TODO: The following loop can probably be parallelized!
    @inbounds for (i, χ) in enumerate(x)
        πs[i] = encode(est.encoding, χ)
    end

is altered to add weights to πs? like, something as simple as πs[i] = encode(est.encoding, χ) * permweight(est, χ)? Or if all of x is required, precoompute the weigts as pweights = permweights(est, x) and then multiply πs[i] with pweights[i]?

EDIT: Seems to me that the weight of any estimator is obtained from the data point itself and does not need additional data points. So, each data point χ generates its own weight without needing the remaining data points in the Dataset. So this is super easy, I just make a function permweight!(π, χ, est) that adds weight to π = πs[i] according to the estimator.

Yes, the point of the weighted and amplitude-adjusted variants is that they use the same encoded symbols, but weight each symbol differently according some property of the state vector from which the symbol was constructed.

Now, the weights are computed using the permutation_weights method (one in SymbolicAmplitudeAwarePermutation.jl and one in SymbolicWeightedPermutation). These weights are then combined with the encoded symbols using the symprobs function in the utils.jl file.

I just make a function permweight!(π, χ, est) that adds weight to π = πs[i] according to the estimator.

Notice that there is sorting going on in symprobs. Any new method must respect the same ordering, so that there is correspondence between the outcomes and the probabilities returned by probabilites_and_outcomes(::SymbolicWeightedPermutation, x).

Datseris commented 1 year ago

What's the point of the in-place method if weights are recomputed in every call to the inplace method? one would need an inplace method with pre-allocated weihts and symbol vector...? E.g., I see in the current SymbolicWeightedPermutation it doesn't even define an in-place method for probabilities!.

Datseris commented 1 year ago

An alternative to even defining an in-place probabilities! function is to require estimators that can do these in-place optimizations to get input. E.g., we can define:

SymPerm(x::Vector, m, tau, ...)
SymPerm(x::Dataset, ...)

Inside they create vectors for the symbol containers and the weight containers and re-use these vectors whenever they are called at probabilities!.

Datseris commented 1 year ago

Ok, I am done for the day. @kahaaga have a look at the new file SymbolicPermutation.jl. It contains all three methods. There are no extra files besides the fashist.jl file. The code is now really polished. I only need to transfer the docstrings, which I will also reduce.

kahaaga commented 1 year ago

Ok, I am done for the day. @kahaaga have a look at the new file SymbolicPermutation.jl. It contains all three methods. There are no extra files besides the fashist.jl file. The code is now really polished. I only need to transfer the docstrings, which I will also reduce.

Cool! Thanks for the effort. I will have a detailed look at all this tomorrow, or whenever you finish the docs.

Datseris commented 1 year ago

I've finished the docs as well, please have a look. I'll go fix test now.

Datseris commented 1 year ago

GaussianCDFEncoding also doesn't respect the encode API, which says that an element χ of the input x is encoded individually.

EDIT: No, I guess it does? I am confused by all the broadcasting that happens on y in the encode of gaussian. It's a real number, right?

Datseris commented 1 year ago

Okay. I am almost done. The only thing that remains now is the Spatial estimators and the Transfer operator. I've decided I will do a review of entropies in a different PR and let this PR focus on the probabilities estimators.

Datseris commented 1 year ago

@kahaaga why are we using so many folders? What's the point of using a folder if its content is a single file? Doesn't this defeat a purpose of a folder inthe first place? I would say, folders that have only one file should be removed and the file extracted to the upper level. Most of our folders have 1 file. Unecessarily cluters the tree navigation.

kahaaga commented 1 year ago

why are we using so many folders? What's the point of using a folder if its content is a single file?

The folders were more relevant back when there were more extra files (e.g. utils.jl) associated with every method. They may not be necessary any longer, since the codebase has been cleaned up quite significantly.

However, at this point, I think we should just be consistent. Either all methods have a folder, or no methods have a folder. I'm fine with just keeping it as is, because it makes it less messy and doesn't break consistency if someone were to introduce a method that requires multiple files. But again, this is not an important issue, so if you want to do the work on changing to a folderless structure, go ahead

kahaaga commented 1 year ago

EDIT: No, I guess it does? I am confused by all the broadcasting that happens on y in the encode of gaussian. It's a real number, right?

Yes, the returned value from encode(::GaussianCDFEncoding) should be a real number. The broadcasting is actually not needed anymore. This is a remnant from when we demanded the encoding of the entire input timeseries at the same time. Since we now pre-compute the empirical mean and standard deviations for the gaussian CDF, encode can now operate on one element at a time, so we don't need to do elementwise operations in the last step.

The question is: should we be strict (only allow scalar input encode(::GaussianCDFEncoding, x::Real)) or should we keep it as is, which would allow you to do encode(::GaussianCDFEncoding, x::AbstractVector too (which would give elementwise mappings to the cdf)?

Datseris commented 1 year ago

The question is: should we be strict (only allow scalar input encode(::GaussianCDFEncoding, x::Real)) or should we keep it as is, which would allow you to do encode(::GaussianCDFEncoding, x::AbstractVector too (which would give elementwise mappings to the cdf)?

No just call map or broadcast manually with .

kahaaga commented 1 year ago

No just call map or broadcast manually with .

I'm not sure if I follow. Am I right you want to be strict about requiring scalar inputs and require the user to broadcast/map manually? :D

Datseris commented 1 year ago

Either all methods have a folder, or no methods have a folder

Why? This sounds like an arbitrary requirement. My suggestion is that things should be in folders if they require more than 1 file, which is unlikely for a probability estimator generally speaking.

kahaaga commented 1 year ago

Why? This sounds like an arbitrary requirement. My suggestion is that things should be in folders if they require more than 1 file, which is unlikely for a probability estimator generally speaking.

It's just a preference I have, but I have no problems with not sticking to that. Your version is also a natural choice. Feel free to drop the folders if you like.

kahaaga commented 1 year ago

@Datseris Are you fixing the remaining tests (some fail after the OrdinalPatternEncoding refactoring) , or shall I do it when reviewing the PR?

Datseris commented 1 year ago

I haven't finished yet, I still need to go through Transfer Operator and this will take a lot of time

kahaaga commented 1 year ago

I haven't finished yet, I still need to go through Transfer Operator and this will take a lot of time

As part of #55, I'll have to re-think the whole transfer operator estimator API anyways in light of all our recent updates. So I suggest we don't dive too deeply into it now, except cleaning up the documentation and cleaning up the source code.

However, I would be very hesitant to do any fundamental changes to the structure of the source code now (i.e. remove/add existing functions/types), because this code is directly used in scripts related to publications that are already out, and in other materials I use, which I won't have time to re-do in a while.

Datseris commented 1 year ago

As part of https://github.com/JuliaDynamics/Entropies.jl/pull/55, I'll have to re-think the whole transfer operator estimator API anyways in light of all our recent updates. So I suggest we don't dive too deeply into it now, except cleaning up the documentation and cleaning up the source code.

Okay then I don't do TraOp at the moment. Well then this PR is finished from my end and should be reviewed. Minor bugfixes for the tests can be corrected overtime but shouldn't hold back the review.

However, I would be very hesitant to do any fundamental changes to the structure of the source code now (i.e. remove/add existing functions/types), because this code is directly used in scripts related to publications that are already out, and in other materials I use, which I won't have time to re-do in a while.

But the publications use Entropies v1.2 which is already out, not only that but modifications to the code will be published under a package with different name! So why would this matter for your existing publicaitons...?

Datseris commented 1 year ago

As you saw, many of the old estimators such as the ValueHistogram or symbolic stuff saw a complete rewrite that led to much simpler code. They are so easy to follow now. Probably worth doing the same with TE.

kahaaga commented 1 year ago

But the publications use Entropies v1.2 which is already out, not only that but modifications to the code will be published under a package with different name! So why would this matter for your existing publicaitons...?

The scripts mostly investigate the transfer operator in the context of causal inference. I still want users to be able to use the existing material, and still be able to to use the latest stuff from CausalityTools V2 when it is released, without having to check out different versions (and remember differences in syntax), and manually using manifests. From experience, that will will quickly become a mess of proportions, unless you know very well what you're doing.

As you saw, many of the old estimators such as the ValueHistogram or symbolic stuff saw a complete rewrite that led to much simpler code. They are so easy to follow now. Probably worth doing the same with TE.

Yes, we definitely should do something similar. But I'll think about that once I have time to address #55.

Well then this PR is finished from my end and should be reviewed.

Ok, excellent. Thanks! I'll have a look at this as soon as possible and follow up with a last review of my own, but probably not until early next week. Some holiday-days are incoming now. Merry christmas, @Datseris!

Datseris commented 1 year ago

Sounds good to me, happy holidays!

Datseris commented 1 year ago

This is done now and all tests pass. I am re-writing the top-level comment to outline the changes. I don't remember all of them. There weere some small changes regarding clarity here and there. I won't write them.

kahaaga commented 1 year ago

This is a review of all probabilities estimators except the Transfer Operator for v2. Once this PR is merged, all probabilities estimators except the Transfer Operator are "set in stone" for v2 and will not get any further changes. So it is important @kahaaga once we merge this to both agree that this is "done" and commit to this doneness so that we can actually get some v2 out at some point in during our lifespans, because if we don't commit I am sure I will be finding out things to change every week given my history with the repo...

Alright, here I outline the major changes did:

  • Improved (in my opinion) heading and outline in docs
  • Simplified and reduced in length several docstrings, mainly for the estimators that use encodings
  • Removed entropy! method coimpletely.
  • Added an Encodings documentation page
  • Adjusted the symbolic encoding: it now operates on the elements (the vectrors from which to estimat the permutations) and in general obeys the encoding API as declared.
  • Complete re-write of the symbolic permutation estimators. it is amazing how little code they occupy now. They are all in one file, and they have the ordinal pattern encoding as their field. They share almost all their lines of code besides 4 lines of code that define the weights of the permutations. The simplification is insane!!!
  • Spatial symbolc estimator also has encoding as a field now, similarly with the others.
  • Similar rewrite of the Dispersion estimator and the Gaussian encoding. The estimator now has the encoding as its field.
  • Fix all tests

@Datseris I've now reviewed this PR in detail, and I must say your changes look very, very good! The simplification of the symbolic estimators is ridiculous :D

I only had some minor docstring changes (typos, slight clarifications/reformulations).

So it is important @kahaaga once we merge this to both agree that this is "done" and commit to this doneness so that we can actually get some v2 out at some point in during our lifespans, because if we don't commit I am sure I will be finding out things to change every week given my history with the repo...

The existing probabilities estimators are now mature enough that I think we can guarantee their v2-stability. There comes a point where further optimisation at the expense of delayed publication becomes a bit silly. I don't think we're beyond the threshold yet, but we're approaching it :D

The only thing I'm a bit unsure of is the multiscale API, which may change. I have still to test whether it makes sense for upstream methods. However, there is no reason to delay Entropies v2 because of that. We can always release a minor breaking version later if necessary.

That said, I'm now merging, so we can get on with the rest on the review, and I can get on with solidifying the probabilities-based estimators in CausalityTools.

Datseris commented 1 year ago

We can always release a minor breaking version later if necessary.

There is no such thing as a minor breaking version. If you think multiscale isn't ready, it shouldn't go into v2.

kahaaga commented 1 year ago

There is no such thing as a minor breaking version. If you think multiscale isn't ready, it shouldn't go into v2.

For the applications in this package, it is ready. I can always take care of any future issues with a simple wrapper upstreams.

Datseris commented 1 year ago

Well in anycase, I haven't review that part yet, so we'll see soon how it is. I'm now going to make my entropies review PR.