Closed kahaaga closed 9 months ago
Merging #316 (eb32804) into main (1d5967d) will increase coverage by
0.06%
. The diff coverage is88.06%
.:exclamation: Current head eb32804 differs from pull request most recent head a75b8f0. Consider uploading reports for the commit a75b8f0 to get more accurate results
@@ Coverage Diff @@
## main #316 +/- ##
==========================================
+ Coverage 87.96% 88.03% +0.06%
==========================================
Files 76 78 +2
Lines 1920 2031 +111
==========================================
+ Hits 1689 1788 +99
- Misses 231 243 +12
Files | Coverage Δ | |
---|---|---|
src/ComplexityMeasures.jl | 100.00% <ø> (ø) |
|
src/core/information_functions.jl | 92.30% <100.00%> (+0.30%) |
:arrow_up: |
src/core/outcome_spaces.jl | 100.00% <ø> (ø) |
|
src/deprecations.jl | 100.00% <100.00%> (ø) |
|
src/discrete_info_estimators/chao_shen.jl | 88.23% <100.00%> (ø) |
|
src/discrete_info_estimators/horvitz_thompson.jl | 100.00% <100.00%> (ø) |
|
src/discrete_info_estimators/jackknife.jl | 100.00% <100.00%> (ø) |
|
src/discrete_info_estimators/miller_madow.jl | 100.00% <100.00%> (ø) |
|
src/discrete_info_estimators/plugin.jl | 100.00% <100.00%> (ø) |
|
src/discrete_info_estimators/schurmann.jl | 85.00% <100.00%> (ø) |
|
... and 23 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
@Datseris FYI: I'm mostly done with this, but we have to wait for a new breaking DimensionalData.jl release, which will contain relevant bug fixes for showing the DimArray
s.
I'm also implementing relevant changes in CausalityTools.jl, to ensure that the new API here doesn't overly complicate anything upstream. I might have to do a few minor adjustments, so I'm holding off merging this until I can ensure smooth interoperability.
- It has named axes (default is
x1
for the first dimension,x2
for the second, etc...), and it stores the outcomes as marginal labels
I think its axes should be named according to the outcome space. In fact, I would argue, it's dimensions should coincide in name with the outcome space.
counts
/allcounts
andprobabilities
/allprobabilities
are now the only functions one has to implement, since the outcomes are always stored as marginal labels toCounts
andProbabilities
. Thus,counts_and_outcomes
andprobabilities_and_outcomes
come for free
What do you mean "come for free"? They should not exist at all. Now, only outcomes
should exist, that given a Counts/Probabilities
instance, it would return just a vector of the corresponding outcomes. What would counts_and_outcomes
return?
symbolize
function is now exported (previously it was only internal).
I wonder if we can have a better name? I guess we can keep symbolize
if we can't find something better, but given that we cast the data to a sequence of integers instead of sumbols maybe we can have a name that better alludes to that. codify
perhaps, since encode
is taken.
symbolize(o::OutcomeSpace, x)
explicitly transforms the input datax
into aVector{Int}
of encoded symbols
Is there any structure to the returned integers? do they match 1-to-1 to the elements of outcome_space(o)
? In the sense that the integer 5
would mean that the data were encoded to the 5
-th element of Ω?
@kahaaga what is the status quo here? Do you need support? I am dedicating my Friday to Hacktoberfest and I can finish up stuff.
@kahaaga what is the status quo here? Do you need support? I am dedicating my Friday to Hacktoberfest and I can finish up stuff.
I think this is close to being done. It is just a matter of ensuring that whatever we do here doesn't introduce unnecessary complications in CausalityTools.
Also depends on a new release of DimensionalData.jl with the mentioned fix.
v0.25 of DimensionalData.jl is now released, so I'll up the version and see if this then works.
@Datseris I think this is ready for a review now.
Did you already adress the three comments I've left above?
Did you already adress the three comments I've left above?
Ah, no, I missed those. I'll deal with them tonight. Hopefully, this is ripe for Hacktoberfest-review tomorrow then.
I think its axes should be named according to the outcome space. In fact, I would argue, it's dimensions should coincide in name with the outcome space.
I have to illustrate this with an example from CausalityTools.jl that actually has several axes (so the code isn't runnable here). Currently, this is the output when using counts
in CausalityTools on two input datasets:
julia> x = rand(['a', 'b'], 50); y = rand(1:3, 50);
julia> counts(x, y)
Counts 2×3 DimArray{Int64,2} with dimensions:
Dim{:x1} Categorical{Char} Char['a', 'b'] ForwardOrdered,
Dim{:x2} Sampled{Int64} Int64[1, 2, 3] ForwardOrdered Irregular Points
1 2 3
'a' 13 7 5
'b' 6 11 8
I failed to communicate above that we can actually estimate counts
without an estimator too. That is the most common use case for contingency tables (counts) elsewhere too. Here, there's no outcome space to base the axis names on. But I guess one could define counts(x, y, ...) = counts(UniqueElements(), x, y, ...)
, since that's what happens when you don't provide a way to discretize.
How does you comment tie in to the example above? Would you replace x1
and x2
, and with what specifically?
we can actually estimate
counts
without an estimator too
Are you sure that is true? It sounds to me that this means that you are estimating counts with UniqueElements
;)
Would you replace
x1
andx2
, and with what specifically?
Create dimension types whose name is the same as the outcome space. If there is a name conflict then make :x1
to be the name of the outcome space.
Are you sure that is true? It sounds to me that this means that you are estimating counts with UniqueElements ;)
True dat. 👍
Create dimension types whose name is the same as the outcome space. If there is a name conflict then make :x1 to be the name of the outcome space.
Can we delay this naming convention until later, and just open an issue for it? It is a bit complicated upstream how this will work, because there's an additional layer of specification to the discretization: column-wise or row-wise. I need to figure out how to organise this in CausalityTools.jl.
I wonder if we can have a better name? I guess we can keep symbolize if we can't find something better, but given that we cast the data to a sequence of integers instead of sumbols maybe we can have a name that better alludes to that. codify perhaps, since encode is taken.
Yes, codify
is good. We'll go with that.
Can we delay this naming convention until later, and just open an issue for it?
I know how to do it as I have done it for ClimateBase.jl. I'll do it once you are done with this PR and I review it. I don't see how CausalityTools.jl will be affected by this. You have Dim{:x1}
now, you will have Dim{:UniqueElements}
after. Something tells me CausalityTools.jl won't be using the name of hte dimension anyways.
Just tag me once this is good to go.
Is there any structure to the returned integers? do they match 1-to-1 to the elements of outcome_space(o)? In the sense that the integer 5 would mean that the data were encoded to the 5-th element of Ω?
codify
is only implemented for OutcomeSpace
s that uses some sort of Encoding
under the hood. The returned integers is guaranteed to be among the integers 1, 2, ..., k
that the particular encoding uses.
The procedure is:
o
(e.g. embed for OrdinalPatterns
). I think for all the current OutcomeSpace
s that implement codify
, there is a 1-to-1 correspondence between these integers and the outcomes (unless we somewhere sort the outcomes, which would break this correspondence)
@Datseris I think this is good for a review now.
I'd suggest starting by having a look at the documentation and trying to run some relevant examples, because the diff is enormous.
okay I am on it. I should be done with it by tonight... (fingers crossed)
@kahaaga I am working on it... Is it really a wise decision to make probabilities and counts dimensional arrays...? What do we gain really? The outcomes
function enables everything that we gain. And the crazy part is, any user would anyways use the outcomes
function. So the user gains nothing.
There are large downsides though:
I can see now it was a big mistake to try and do "everything" in this PR: instead of doing the 1 thing, decoupling the estimators from outcome spaces, we did two things, also porting everythingg to dimensional data.
Personally, I am not sure about this change. I can see now, as I am re-writing the tutorial, that this adds a lot of complexity, and practically 0 benefit, as any practical usage that cares about the outcomes would anyways call the outcomes
function. The fact that something is printed doesn't mean it is usable.
I don't know... What do you think?
For the future, lessons to be learned: it is a well established good practice that PRs should do one thing. I teach this in my good scientific code workshop, But we didn't really follow good practices here. Next time we should both keep this in mind and really tell this to each other: each time a PR does more than one thing, it should become two PRs....
@kahaaga it is very diffuclt for me to revert the DimensionalData.jl changes. You're the one more capable to know which parts of the code are actually affected by this. Could you please, pleaseeeee revert the DimensionalData.jl so that we discuss this possibility in a different PR, where we can see more clearly what impact it has in the documentation? Right now there are vast changes from other aspects and makes it hard for us to make the best decision.
I am really sorry to ask for all this extra work, even though we both agreed together on dimensional data.jl... But can you pleaseeeee please do it?
For the future, lessons to be learned: it is a well established good practice that PRs should do one thing. I teach this in my good scientific code workshop, But we didn't really follow good practices here. Next time we should both keep this in mind and really tell this to each other: each time a PR does more than one thing, it should become two PRs....
Yep, agreed.
I am working on it... Is it really a wise decision to make probabilities and counts dimensional arrays...? What do we gain really?
In this package, nothing of value (except nice printing) is gained. However, for the higher-dimensional counts and probabilities, it is extremely valuable, because it makes writing good tutorials based on analytic examples much, much easier when you can keep track of the outcomes.
The entrance threshold into ComplexityMeasures.jl becomes drastically larger. I can see this now in the docs. This semester I am using ComplexityMeasiures.jl with a group of Bachelor students. It is complete overkill to start explaining to the students dimensional data. Yet, with the current status, I need to explain dimensional data in the very second codeblock of the example.........
Do you need to explain dimensional data at all? With ComplexityMeasures.jl, we're working with 1D data anyways, so it shouldn't be necessary to say anything beyond "a Counts
/Probabilities
instance is a vector - it has nice printing, so you can see what is being counted / what the probability is for".
For the multivariate measures in CausalityTools.jl, using DimensionalData.jl has the opposite effect, because it immediately becomes clear what the multi-sum formulas (e.g. for conditional mutual information) really mean when you can inspect the marginals like so:
julia> x = rand(['a', 'b', 'c'], 100); y = rand(1:4, 100);
julia> counts(UniqueElements(), x, y)
Counts 3×4 DimArray{Int64,2} with dimensions:
Dim{:x1} Categorical{Char} Char['a', 'c', 'b'] Unordered,
Dim{:x2} Sampled{Int64} Int64[4, 1, 3, 2] Unordered Irregular Points
4 1 3 2
'a' 9 9 10 4
'c' 8 7 9 13
'b' 8 6 11 6
It makes it clear that the sums go over outcomes, not raw data, and that you have to go through some transformation to get there.
I don't know... What do you think?
I agree that the dependency on DimensionalData is unfortunate. But I don't quite get where the huge learning curve comes from when it comes to dimensional data - we can just say in the tutorial and docs that it behaves as a regular vector/array with nice printing.
What if we simply remove the fancy printing that comes with DimensionalData.jl?
julia> counts(UniqueElements(), x, y)
Counts 3×4 DimArray{Int64,2} with dimensions:
Dim{:x1} Categorical{Char} Char['a', 'c', 'b'] Unordered,
Dim{:x2} Sampled{Int64} Int64[4, 1, 3, 2] Unordered Irregular Points
4 1 3 2
'a' 9 9 10 4
'c' 8 7 9 13
'b' 8 6 11 6
becomes
julia> counts(UniqueElements(), x, y)
3×4 Counts :
4 1 3 2
'a' 9 9 10 4
'c' 8 7 9 13
'b' 8 6 11 6
Then there is nothing to explain - the counts/probabilities are just matrices with labels in the margins.
Do you know how to do this? To remove the fancy printing? This would be a big step forwards.
From re-writing the tutorial (I didn't push the changes), I do believe that it is a big burden for beginners. Mind you, most people that do timeseries analysis do not have any obvious relation with dimensional data. And when these people are also beginners in programming, you really want to focus minimizing the new concepts necessary to start learning a package.
The last problem is that, at least in the documentation (https://juliadynamics.github.io/ComplexityMeasures.jl/previews/PR316/tutorial/ ) the dimensional array is not printed with the left column grayed out. This makes it ambiguous which column is what. Hence it is definitely mandatory in my view to explain why you have two columns and what does each column mean...
but wait
For the multivariate measures in CausalityTools.jl, using DimensionalData.jl has the opposite effect, because it immediately becomes clear what the multi-sum formulas (e.g. for conditional mutual information) really mean when you can inspect the marginals like so:
How is this the case? You mean to visually inspect? Because to inspect programmatically, you anways have to call outcomes
to obtain the actual outcome data.
I think printing only the data with the marginals, and not the "dimensional data" info, is a compromise I can live with... But you should also know that I had headaches with ClimateBase.jl with DimensionalData.jl getting breaking releases all the time (version 0.25 means 24 breaking releases if you think about it). I have stopped updating ClimateBase.jl due to this headache.
Do you know how to do this? To remove the fancy printing? This would be a big step forwards.
Sure, it's no problem to modify the printing. I've already manually modified the default printing that DimensionalData.jl offers. If we want to do something more advanced when it comes to colors etc for the margins, I also kind of know how the internals of the printing work in DimensionalData.jl, because I committed a PR that fixed it for their v0.25 release.
We can also decide to just completely remove the margin labels for the 1D case, since they're not needed anyways, if that makes life easier.
How is this the case? You mean to visually inspect?
Yep, I mean nothing fancy - just to visually look at the counts/probabilities arrays, and compare with the formulas. You probably know the learning curve involved with understanding counts/outcome spaces/etc already in the 1D case. From experience, students have a really hard time understanding how to go from multiple raw time series to multivariate counts/probabilities, and relate them to the formulas for e.g. mutual information. The labels in the margins really help here, from a pedagogical standpoint.
The problem is that no matter what we do to the color printing, Documenter.jl will render it the same as the main column. At least I do not know of a way to modify it.
But for now, can you just do the removal of the headers? So that it only prints Probabilities
and then the data (with the outcome space marginals)? Ill have another critical view of the tutorial after that and see how more information one needs to say.
I think printing only the data with the marginals, and not the "dimensional data" info, is a compromise I can live with... But you should also know that I had headaches with ClimateBase.jl with DimensionalData.jl getting breaking releases all the time (version 0.25 means 24 breaking releases if you think about it). I have stopped updating ClimateBase.jl due to this headache.
Yeah, the many breaking releases is annoying. Have you had any communication with the package authors about the timeline of the package (i.e. if a stable release is anywhere on the horizon)?
We really don't need anything fancy. All I want(need, really, for pedagogical purposes) is to display labels on the marginals. An alternative could be to write a super-minimal package that does only fancy label printing, and depend on that instead. It shouldn't be too much work, but I really should prioritize getting this PR, and then CausalityTools.jl v3 which depends on it out first.
So to make things concrete, we should get
70-element Probabilities{Float64, 1}
[-4.15266301664966] 0.0001
[-3.9526630166496597] 0.0001
[-3.4526630166496597] 0.0001
[-3.35266301664966] 0.0002
[-3.25266301664966] 0.0004
[-3.15266301664966] 0.0001
[-3.0526630166496598] 0.0005
⋮
[2.647336983350341] 0.0011
[2.7473369833503405] 0.0009
[2.84733698335034] 0.0005
[2.9473369833503407] 0.0004
[3.0473369833503403] 0.0001
[3.147336983350341] 0.0005
[3.2473369833503405] 0.0002
So to make things concrete, we should get
Sure, no problem. Give me a few minutes, and I'll commit the change.
Sure, no problem. Give me a few minutes, and I'll commit the change.
I immediately see that I was a bit optimistic. I need to do some more digging into the source code for DimensionalData.jl again, but can get it done before the weekend is over. Maybe we just continue with the review under the assumption that we'll have nice printing before it's merged, and I'll have a deeper look tomorrow?
ok
If we stick with using DimensionalData.jl, the main change we need to do is that probabilities_and_outcomes
needs to become the trivial function
probs = probabilities(...)
outs = outcomes(probs)
return probs, outs
instead of being the function used and called by outcomes
.
Is this ok with you too, @Datseris?
julia> x = rand(1:5, 100);
julia> counts(x)
5-element Counts{Int64,1}
1 21
2 18
3 20
4 20
5 21
julia> probabilities(x)
5-element Probabilities{Int64,1}
1 0.21
2 0.18
3 0.2
4 0.2
5 0.21
the main change we need to do is that probabilities_and_outcomes needs to become the trivial function
So probabilities_and_outcomes
doesn't really exist it we stick with DimensionalData, since the outcomes are contained in the marginals and can be gotten by calling outcomes(p::Probabilities)
, right?
Yes, that is okay.
So
probabilities_and_outcomes
doesn't really exist it we stick with DimensionalData, since the outcomes are contained in the marginals and can be gotten by callingoutcomes(p::Probabilities)
, right?
Yes, it only exists in deprecations.jl for bakwards compatibility.
I see now we have a problem with ProbabilitiesEstimator
. These types must wrap the outcome space, right? Otherwise we have problems downstream:
information(est_or_def, probest_or_def, x, y, ...)
isn't this the call signature we want in https://github.com/JuliaDynamics/CausalityTools.jl/issues/352 ?
In any case, this is becoming too complex. Multiple dispach should not be used to operate on more than 3 arguments. Not because it is not possible. But because it becomes too complex for the human brain, and incredibly difficult to track in code.
I'll open a new issue about this.
A thorough description will follow soon. No need to review this yet.
The diff is yet again huge, because API changes propagate through almost all files. However, changes are best viewed in the docs.
TODO:
Changes
Counting/probabilities
Counts
struct. It is a wrapper aroundDimArray{<:Integer, N}
.counts([o::OutcomeSpace], x)
function.counts(o::OutcomeSpace, x)
.x1
for the first dimension,x2
for the second, etc...), and it stores the outcomes as marginal labels. Outcomes can be accessed from aCounts
instance usingoutcomes(cts::Counts)
.Probabilities
now wrapDimArray{T, N}
instead ofArray{T, N}
.probabilities([est::ProbabilitiesEstimator], cts::Counts)
.probabilities([est::ProbabilitiesEstimator], o::OutcomeSpace, x)
.x1
for the first dimension,x2
for the second, etc...), and it stores the outcomes as marginal labels. Outcomes can be accessed from aProbabilities
instance usingoutcomes(p::Probabilities)
.Consequences:
counts
/allcounts
andprobabilities
/allprobabilities
are now the only functions one has to implement, since the outcomes are always stored as marginal labels toCounts
andProbabilities
. Thus,counts_and_outcomes
andprobabilities_and_outcomes
come for free. As a result:OutcomeSpace
s.Outcome spaces
Renamed
ValueHistogram
toValueBinning
, (as discussed in #306). Since we can compute histograms for all count-based outcome spaces usingcounts
, it makes sense to renameValueHistogram
to something that actually represents better what the distinct categories for the histogram are: value bins.Renamed
CountOccurrences
toUniqueElements
(as discussed in #312). Deprecation added too.symbolize
function is now exported (previously it was only internal).symbolize
is implemented for all counting-basedOutcomeSpace
s.symbolize(o::OutcomeSpace, x)
explicitly transforms the input datax
into aVector{Int}
of encoded symbols. For efficiency purposes, this step is skipped for some of the outcome space here. However, it is mandatory for computing multivariate measures in CausalityTools.jl, so it is now exported.symbolize
is to distinguish between encoding state vector (usingencode
) and time series vectors (which are encoded usingsymbolize
). This needs to be two separate functions, sinceOutcomeSpace
s first apply some transformation to the data, whileEncode
encodes directly.Encodings
UniqueElementsEncoding
(as discussed in #312).Minor changes
tutorial.jl
for consistent line length. Also rewrote some sentences for better flow (without taking away any of the meaning), and fixed some weird punctuation.