WIP: Decouple `ProbabilitiesEstimator` from `OutcomeSpace`s

kahaaga commented 10 months ago

A thorough description will follow soon. No need to review this yet.

Fixes #313.
Fixes #314.
Fixes #288.
Fixes #312.

The diff is yet again huge, because API changes propagate through almost all files. However, changes are best viewed in the docs.

TODO:

[x] Ensure test coverage doesn't drop.
[x] Update documentation strings
[x] Update tutorial
[x] Update devdocs
[x] Update documentation examples
[x] Await new minor release for DimensionalData.jl. Docs won't build because of this issue, which I fixed in this PR and is now merged to the main branch. However, a new version is not released.
[x] Fix #317. Also depends on a new release of DimensionalData.jl with the mentioned fix.
[x] Temporarily install DimensionalData.jl to main branch during doc build, so that we can preview this PR.

Changes

Counting/probabilities

Introduced Counts struct. It is a wrapper around DimArray{<:Integer, N}.
- It can be constructed using the counts([o::OutcomeSpace], x) function.
- Its purpose is to store counts and outcomes that arise when calling counts(o::OutcomeSpace, x).
- It has named axes (default is x1 for the first dimension, x2 for the second, etc...), and it stores the outcomes as marginal labels. Outcomes can be accessed from a Counts instance using outcomes(cts::Counts).
- Only 1D estimation is implemented here. Higher-dimensional estimation will be covered in CausalityTools.jl.
Probabilities now wrap DimArray{T, N} instead of Array{T, N}.
- It can be constructed using either of the following methods
  - probabilities([est::ProbabilitiesEstimator], cts::Counts).
  - probabilities([est::ProbabilitiesEstimator], o::OutcomeSpace, x).
- It has named axes (default is x1 for the first dimension, x2 for the second, etc...), and it stores the outcomes as marginal labels. Outcomes can be accessed from a Probabilities instance using outcomes(p::Probabilities).
- Only 1D estimation is implemented here. Higher-dimensional estimation will be covered in CausalityTools.jl.

Consequences:

counts/allcounts and probabilities/allprobabilities are now the only functions one has to implement, since the outcomes are always stored as marginal labels to Counts and Probabilities. Thus, counts_and_outcomes and probabilities_and_outcomes come for free. As a result:
- Much fewer lines of code for many of the OutcomeSpaces.

Outcome spaces

Renamed ValueHistogram to ValueBinning, (as discussed in #306). Since we can compute histograms for all count-based outcome spaces using counts, it makes sense to rename ValueHistogram to something that actually represents better what the distinct categories for the histogram are: value bins.
Renamed CountOccurrences to UniqueElements (as discussed in #312). Deprecation added too.
symbolize function is now exported (previously it was only internal).
- symbolizeis implemented for all counting-based OutcomeSpaces.
- symbolize(o::OutcomeSpace, x) explicitly transforms the input data x into a Vector{Int} of encoded symbols. For efficiency purposes, this step is skipped for some of the outcome space here. However, it is mandatory for computing multivariate measures in CausalityTools.jl, so it is now exported.
- The purpose of symbolize is to distinguish between encoding state vector (using encode) and time series vectors (which are encoded using symbolize). This needs to be two separate functions, since OutcomeSpaces first apply some transformation to the data, while Encode encodes directly.

Encodings

New encoding UniqueElementsEncoding (as discussed in #312).

Minor changes

Reformatted tutorial.jl for consistent line length. Also rewrote some sentences for better flow (without taking away any of the meaning), and fixed some weird punctuation.

codecov[bot] commented 10 months ago

Codecov Report

Merging #316 (eb32804) into main (1d5967d) will increase coverage by 0.06%. The diff coverage is 88.06%.

:exclamation: Current head eb32804 differs from pull request most recent head a75b8f0. Consider uploading reports for the commit a75b8f0 to get more accurate results

@@            Coverage Diff             @@
##             main     #316      +/-   ##
==========================================
+ Coverage   87.96%   88.03%   +0.06%     
==========================================
  Files          76       78       +2     
  Lines        1920     2031     +111     
==========================================
+ Hits         1689     1788      +99     
- Misses        231      243      +12

Files	Coverage Δ
src/ComplexityMeasures.jl	`100.00% <ø> (ø)`
src/core/information_functions.jl	`92.30% <100.00%> (+0.30%)`	:arrow_up:
src/core/outcome_spaces.jl	`100.00% <ø> (ø)`
src/deprecations.jl	`100.00% <100.00%> (ø)`
src/discrete_info_estimators/chao_shen.jl	`88.23% <100.00%> (ø)`
src/discrete_info_estimators/horvitz_thompson.jl	`100.00% <100.00%> (ø)`
src/discrete_info_estimators/jackknife.jl	`100.00% <100.00%> (ø)`
src/discrete_info_estimators/miller_madow.jl	`100.00% <100.00%> (ø)`
src/discrete_info_estimators/plugin.jl	`100.00% <100.00%> (ø)`
src/discrete_info_estimators/schurmann.jl	`85.00% <100.00%> (ø)`
... and 23 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

kahaaga commented 10 months ago

@Datseris FYI: I'm mostly done with this, but we have to wait for a new breaking DimensionalData.jl release, which will contain relevant bug fixes for showing the DimArrays.

I'm also implementing relevant changes in CausalityTools.jl, to ensure that the new API here doesn't overly complicate anything upstream. I might have to do a few minor adjustments, so I'm holding off merging this until I can ensure smooth interoperability.

Datseris commented 10 months ago

It has named axes (default is x1 for the first dimension, x2 for the second, etc...), and it stores the outcomes as marginal labels

I think its axes should be named according to the outcome space. In fact, I would argue, it's dimensions should coincide in name with the outcome space.

Datseris commented 10 months ago

counts/allcounts and probabilities/allprobabilities are now the only functions one has to implement, since the outcomes are always stored as marginal labels to Counts and Probabilities. Thus, counts_and_outcomes and probabilities_and_outcomes come for free

What do you mean "come for free"? They should not exist at all. Now, only outcomes should exist, that given a Counts/Probabilities instance, it would return just a vector of the corresponding outcomes. What would counts_and_outcomes return?

Datseris commented 10 months ago

symbolize function is now exported (previously it was only internal).

I wonder if we can have a better name? I guess we can keep symbolize if we can't find something better, but given that we cast the data to a sequence of integers instead of sumbols maybe we can have a name that better alludes to that. codify perhaps, since encode is taken.

Datseris commented 10 months ago

symbolize(o::OutcomeSpace, x) explicitly transforms the input data x into a Vector{Int} of encoded symbols

Is there any structure to the returned integers? do they match 1-to-1 to the elements of outcome_space(o)? In the sense that the integer 5 would mean that the data were encoded to the 5-th element of Ω?

Datseris commented 9 months ago

@kahaaga what is the status quo here? Do you need support? I am dedicating my Friday to Hacktoberfest and I can finish up stuff.

kahaaga commented 9 months ago

@kahaaga what is the status quo here? Do you need support? I am dedicating my Friday to Hacktoberfest and I can finish up stuff.

I think this is close to being done. It is just a matter of ensuring that whatever we do here doesn't introduce unnecessary complications in CausalityTools.

Also depends on a new release of DimensionalData.jl with the mentioned fix.

v0.25 of DimensionalData.jl is now released, so I'll up the version and see if this then works.

kahaaga commented 9 months ago

@Datseris I think this is ready for a review now.

Datseris commented 9 months ago

Did you already adress the three comments I've left above?

kahaaga commented 9 months ago

Did you already adress the three comments I've left above?

Ah, no, I missed those. I'll deal with them tonight. Hopefully, this is ripe for Hacktoberfest-review tomorrow then.

kahaaga commented 9 months ago

I think its axes should be named according to the outcome space. In fact, I would argue, it's dimensions should coincide in name with the outcome space.

I have to illustrate this with an example from CausalityTools.jl that actually has several axes (so the code isn't runnable here). Currently, this is the output when using counts in CausalityTools on two input datasets:

julia> x = rand(['a', 'b'], 50); y = rand(1:3, 50);

julia> counts(x, y) 
Counts 2×3 DimArray{Int64,2} with dimensions:
  Dim{:x1} Categorical{Char} Char['a', 'b'] ForwardOrdered,
  Dim{:x2} Sampled{Int64} Int64[1, 2, 3] ForwardOrdered Irregular Points
        1   2  3
  'a'  13   7  5
  'b'   6  11  8

I failed to communicate above that we can actually estimate counts without an estimator too. That is the most common use case for contingency tables (counts) elsewhere too. Here, there's no outcome space to base the axis names on. But I guess one could define counts(x, y, ...) = counts(UniqueElements(), x, y, ...), since that's what happens when you don't provide a way to discretize.

How does you comment tie in to the example above? Would you replace x1 and x2, and with what specifically?

Datseris commented 9 months ago

we can actually estimate counts without an estimator too

Are you sure that is true? It sounds to me that this means that you are estimating counts with UniqueElements ;)

Would you replace x1 and x2, and with what specifically?

Create dimension types whose name is the same as the outcome space. If there is a name conflict then make :x1 to be the name of the outcome space.

kahaaga commented 9 months ago

Are you sure that is true? It sounds to me that this means that you are estimating counts with UniqueElements ;)

True dat. 👍

Create dimension types whose name is the same as the outcome space. If there is a name conflict then make :x1 to be the name of the outcome space.

Can we delay this naming convention until later, and just open an issue for it? It is a bit complicated upstream how this will work, because there's an additional layer of specification to the discretization: column-wise or row-wise. I need to figure out how to organise this in CausalityTools.jl.

kahaaga commented 9 months ago

I wonder if we can have a better name? I guess we can keep symbolize if we can't find something better, but given that we cast the data to a sequence of integers instead of sumbols maybe we can have a name that better alludes to that. codify perhaps, since encode is taken.

Yes, codify is good. We'll go with that.

Datseris commented 9 months ago

Can we delay this naming convention until later, and just open an issue for it?

I know how to do it as I have done it for ClimateBase.jl. I'll do it once you are done with this PR and I review it. I don't see how CausalityTools.jl will be affected by this. You have Dim{:x1} now, you will have Dim{:UniqueElements} after. Something tells me CausalityTools.jl won't be using the name of hte dimension anyways.

Datseris commented 9 months ago

Just tag me once this is good to go.

kahaaga commented 9 months ago

Is there any structure to the returned integers? do they match 1-to-1 to the elements of outcome_space(o)? In the sense that the integer 5 would mean that the data were encoded to the 5-th element of Ω?

codify is only implemented for OutcomeSpaces that uses some sort of Encoding under the hood. The returned integers is guaranteed to be among the integers 1, 2, ..., k that the particular encoding uses.

The procedure is:

Transform the input data according to the outcome space o (e.g. embed for OrdinalPatterns).
Create the relevant encoding (since it isn't stored explicitly as part of the outcome space)
Encode each transformed point (to an integer) according to this encoding.
Return the integers

I think for all the current OutcomeSpaces that implement codify, there is a 1-to-1 correspondence between these integers and the outcomes (unless we somewhere sort the outcomes, which would break this correspondence)

kahaaga commented 9 months ago

@Datseris I think this is good for a review now.

I'd suggest starting by having a look at the documentation and trying to run some relevant examples, because the diff is enormous.

Datseris commented 9 months ago

okay I am on it. I should be done with it by tonight... (fingers crossed)

Datseris commented 9 months ago

@kahaaga I am working on it... Is it really a wise decision to make probabilities and counts dimensional arrays...? What do we gain really? The outcomes function enables everything that we gain. And the crazy part is, any user would anyways use the outcomes function. So the user gains nothing.

There are large downsides though:

ComplexityMeasures.jl becomes strongly dependent to DimensionalData.jl, a package that doesn't appear anywhere near a 1.0 release.
The entrance threshold into ComplexityMeasures.jl becomes drastically larger. I can see this now in the docs. This semester I am using ComplexityMeasiures.jl with a group of Bachelor students. It is complete overkill to start explaining to the students dimensional data. Yet, with the current status, I need to explain dimensional data in the very second codeblock of the example.........

I can see now it was a big mistake to try and do "everything" in this PR: instead of doing the 1 thing, decoupling the estimators from outcome spaces, we did two things, also porting everythingg to dimensional data.

Personally, I am not sure about this change. I can see now, as I am re-writing the tutorial, that this adds a lot of complexity, and practically 0 benefit, as any practical usage that cares about the outcomes would anyways call the outcomes function. The fact that something is printed doesn't mean it is usable.

I don't know... What do you think?

For the future, lessons to be learned: it is a well established good practice that PRs should do one thing. I teach this in my good scientific code workshop, But we didn't really follow good practices here. Next time we should both keep this in mind and really tell this to each other: each time a PR does more than one thing, it should become two PRs....

Datseris commented 9 months ago

@kahaaga it is very diffuclt for me to revert the DimensionalData.jl changes. You're the one more capable to know which parts of the code are actually affected by this. Could you please, pleaseeeee revert the DimensionalData.jl so that we discuss this possibility in a different PR, where we can see more clearly what impact it has in the documentation? Right now there are vast changes from other aspects and makes it hard for us to make the best decision.

I am really sorry to ask for all this extra work, even though we both agreed together on dimensional data.jl... But can you pleaseeeee please do it?

kahaaga commented 9 months ago

For the future, lessons to be learned: it is a well established good practice that PRs should do one thing. I teach this in my good scientific code workshop, But we didn't really follow good practices here. Next time we should both keep this in mind and really tell this to each other: each time a PR does more than one thing, it should become two PRs....

Yep, agreed.

I am working on it... Is it really a wise decision to make probabilities and counts dimensional arrays...? What do we gain really?

In this package, nothing of value (except nice printing) is gained. However, for the higher-dimensional counts and probabilities, it is extremely valuable, because it makes writing good tutorials based on analytic examples much, much easier when you can keep track of the outcomes.

The entrance threshold into ComplexityMeasures.jl becomes drastically larger. I can see this now in the docs. This semester I am using ComplexityMeasiures.jl with a group of Bachelor students. It is complete overkill to start explaining to the students dimensional data. Yet, with the current status, I need to explain dimensional data in the very second codeblock of the example.........

Do you need to explain dimensional data at all? With ComplexityMeasures.jl, we're working with 1D data anyways, so it shouldn't be necessary to say anything beyond "a Counts/Probabilities instance is a vector - it has nice printing, so you can see what is being counted / what the probability is for".

For the multivariate measures in CausalityTools.jl, using DimensionalData.jl has the opposite effect, because it immediately becomes clear what the multi-sum formulas (e.g. for conditional mutual information) really mean when you can inspect the marginals like so:

julia> x = rand(['a', 'b', 'c'], 100); y = rand(1:4, 100);

julia> counts(UniqueElements(), x, y)
Counts 3×4 DimArray{Int64,2} with dimensions:
  Dim{:x1} Categorical{Char} Char['a', 'c', 'b'] Unordered,
  Dim{:x2} Sampled{Int64} Int64[4, 1, 3, 2] Unordered Irregular Points
       4  1   3   2
  'a'  9  9  10   4
  'c'  8  7   9  13
  'b'  8  6  11   6

It makes it clear that the sums go over outcomes, not raw data, and that you have to go through some transformation to get there.

I don't know... What do you think?

I agree that the dependency on DimensionalData is unfortunate. But I don't quite get where the huge learning curve comes from when it comes to dimensional data - we can just say in the tutorial and docs that it behaves as a regular vector/array with nice printing.

kahaaga commented 9 months ago

What if we simply remove the fancy printing that comes with DimensionalData.jl?

julia> counts(UniqueElements(), x, y)
Counts 3×4 DimArray{Int64,2} with dimensions:
  Dim{:x1} Categorical{Char} Char['a', 'c', 'b'] Unordered,
  Dim{:x2} Sampled{Int64} Int64[4, 1, 3, 2] Unordered Irregular Points
       4  1   3   2
  'a'  9  9  10   4
  'c'  8  7   9  13
  'b'  8  6  11   6

becomes

julia> counts(UniqueElements(), x, y)
3×4 Counts :
       4  1   3   2
  'a'  9  9  10   4
  'c'  8  7   9  13
  'b'  8  6  11   6

Then there is nothing to explain - the counts/probabilities are just matrices with labels in the margins.

Datseris commented 9 months ago

Do you know how to do this? To remove the fancy printing? This would be a big step forwards.

From re-writing the tutorial (I didn't push the changes), I do believe that it is a big burden for beginners. Mind you, most people that do timeseries analysis do not have any obvious relation with dimensional data. And when these people are also beginners in programming, you really want to focus minimizing the new concepts necessary to start learning a package.

The last problem is that, at least in the documentation (https://juliadynamics.github.io/ComplexityMeasures.jl/previews/PR316/tutorial/ ) the dimensional array is not printed with the left column grayed out. This makes it ambiguous which column is what. Hence it is definitely mandatory in my view to explain why you have two columns and what does each column mean...

Datseris commented 9 months ago

but wait

Datseris commented 9 months ago

For the multivariate measures in CausalityTools.jl, using DimensionalData.jl has the opposite effect, because it immediately becomes clear what the multi-sum formulas (e.g. for conditional mutual information) really mean when you can inspect the marginals like so:

How is this the case? You mean to visually inspect? Because to inspect programmatically, you anways have to call outcomes to obtain the actual outcome data.

Datseris commented 9 months ago

I think printing only the data with the marginals, and not the "dimensional data" info, is a compromise I can live with... But you should also know that I had headaches with ClimateBase.jl with DimensionalData.jl getting breaking releases all the time (version 0.25 means 24 breaking releases if you think about it). I have stopped updating ClimateBase.jl due to this headache.

kahaaga commented 9 months ago

Do you know how to do this? To remove the fancy printing? This would be a big step forwards.

Sure, it's no problem to modify the printing. I've already manually modified the default printing that DimensionalData.jl offers. If we want to do something more advanced when it comes to colors etc for the margins, I also kind of know how the internals of the printing work in DimensionalData.jl, because I committed a PR that fixed it for their v0.25 release.

We can also decide to just completely remove the margin labels for the 1D case, since they're not needed anyways, if that makes life easier.

How is this the case? You mean to visually inspect?

Yep, I mean nothing fancy - just to visually look at the counts/probabilities arrays, and compare with the formulas. You probably know the learning curve involved with understanding counts/outcome spaces/etc already in the 1D case. From experience, students have a really hard time understanding how to go from multiple raw time series to multivariate counts/probabilities, and relate them to the formulas for e.g. mutual information. The labels in the margins really help here, from a pedagogical standpoint.

Datseris commented 9 months ago

The problem is that no matter what we do to the color printing, Documenter.jl will render it the same as the main column. At least I do not know of a way to modify it.

But for now, can you just do the removal of the headers? So that it only prints Probabilities and then the data (with the outcome space marginals)? Ill have another critical view of the tutorial after that and see how more information one needs to say.

kahaaga commented 9 months ago

I think printing only the data with the marginals, and not the "dimensional data" info, is a compromise I can live with... But you should also know that I had headaches with ClimateBase.jl with DimensionalData.jl getting breaking releases all the time (version 0.25 means 24 breaking releases if you think about it). I have stopped updating ClimateBase.jl due to this headache.

Yeah, the many breaking releases is annoying. Have you had any communication with the package authors about the timeline of the package (i.e. if a stable release is anywhere on the horizon)?

We really don't need anything fancy. All I want(need, really, for pedagogical purposes) is to display labels on the marginals. An alternative could be to write a super-minimal package that does only fancy label printing, and depend on that instead. It shouldn't be too much work, but I really should prioritize getting this PR, and then CausalityTools.jl v3 which depends on it out first.

Datseris commented 9 months ago

So to make things concrete, we should get

70-element Probabilities{Float64, 1}

 [-4.15266301664966]  0.0001
 [-3.9526630166496597]  0.0001
 [-3.4526630166496597]  0.0001
 [-3.35266301664966]  0.0002
 [-3.25266301664966]  0.0004
 [-3.15266301664966]  0.0001
 [-3.0526630166496598]  0.0005
 ⋮           
 [2.647336983350341]   0.0011
 [2.7473369833503405]   0.0009
 [2.84733698335034]   0.0005
 [2.9473369833503407]   0.0004
 [3.0473369833503403]   0.0001
 [3.147336983350341]   0.0005
 [3.2473369833503405]   0.0002

kahaaga commented 9 months ago

So to make things concrete, we should get

Sure, no problem. Give me a few minutes, and I'll commit the change.

kahaaga commented 9 months ago

Sure, no problem. Give me a few minutes, and I'll commit the change.

I immediately see that I was a bit optimistic. I need to do some more digging into the source code for DimensionalData.jl again, but can get it done before the weekend is over. Maybe we just continue with the review under the assumption that we'll have nice printing before it's merged, and I'll have a deeper look tomorrow?

Datseris commented 9 months ago

ok

Datseris commented 9 months ago

If we stick with using DimensionalData.jl, the main change we need to do is that probabilities_and_outcomes needs to become the trivial function

probs = probabilities(...)
outs = outcomes(probs)
return probs, outs

instead of being the function used and called by outcomes.

kahaaga commented 9 months ago

Is this ok with you too, @Datseris?

julia> x = rand(1:5, 100);

julia> counts(x)
5-element Counts{Int64,1}
 1  21
 2  18
 3  20
 4  20
 5  21

julia> probabilities(x)
5-element Probabilities{Int64,1}
 1  0.21
 2  0.18
 3  0.2
 4  0.2
 5  0.21

kahaaga commented 9 months ago

the main change we need to do is that probabilities_and_outcomes needs to become the trivial function

So probabilities_and_outcomes doesn't really exist it we stick with DimensionalData, since the outcomes are contained in the marginals and can be gotten by calling outcomes(p::Probabilities), right?

Datseris commented 9 months ago

Yes, that is okay.

So probabilities_and_outcomes doesn't really exist it we stick with DimensionalData, since the outcomes are contained in the marginals and can be gotten by calling outcomes(p::Probabilities), right?

Yes, it only exists in deprecations.jl for bakwards compatibility.

Datseris commented 9 months ago

I see now we have a problem with ProbabilitiesEstimator. These types must wrap the outcome space, right? Otherwise we have problems downstream:

information(est_or_def, probest_or_def, x, y, ...)

isn't this the call signature we want in https://github.com/JuliaDynamics/CausalityTools.jl/issues/352 ?

In any case, this is becoming too complex. Multiple dispach should not be used to operate on more than 3 arguments. Not because it is not possible. But because it becomes too complex for the human brain, and incredibly difficult to track in code.

I'll open a new issue about this.

JuliaDynamics / ComplexityMeasures.jl