Add `Counts` type, and allow *any* `AbstractArray` type for both `Counts` and `Probabilities`

JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems

MIT License

60 stars 14 forks source link

Add `Counts` type, and allow any `AbstractArray` type for both `Counts` and `Probabilities` #314

Closed kahaaga closed 1 year ago

kahaaga commented 1 year ago

In CausalityTools, my current implementation of ContingencyTable uses NamedArray from NamedArrays.jl to store the counts. This allows naming the the indices of the count array.


n = 100
x = rand(["a", "b", "c", 2], n)
y = StateSpaceSet(rand(rng, n, 2))
encodings = PerPointEncoding(CategoricalEncoding(x), OrdinalPatternEncoding(3))
c = contingency_table(encodings, x, y)

z = StateSpaceSet(randn(n, 2))
encodings = PerPointEncoding(
    CategoricalEncoding(x),
    OrdinalPatternEncoding(2),
    GaussianCDFEncoding{2}(μ = 0.0, σ = 0.3; c = 3)
)
c = contingency_table(encodings, x, y, z);

We can then do use either regular indices or outcomes to index the counts.

julia> c[1:end-1, :, Name(c.outcomes[3][3])]
3×2 Named Matrix{Int64}
A ╲ B │ [2, 1]  [1, 2]
──────┼───────────────
b     │      3       2
a     │      3       1
2     │      0       3

IMO this this is far easier to work with than needing to call outcomes separately and manually matching the indices with the outcomes. The named marginals make it immediately clear what is counted. There is also no performance penalties when it comes to computing with the named arrays - just some extra memory due to allocating the marginal names upon construction.

More generally

This also leads me to think that perhaps we can implement the Counts type too, analogous to Probabilities.

Counts{N} -> an N-dimensional counts array
Probabilities{N} -> an N-dimensional probabilities array, which is computed from a Counts instance using some ProbabilitiesEstimator.

Defining these two types in ComplexityMeasures.jl completely eliminates the need for custom ContingencyTable (counts) and ContingencyMatrix (probabilities) upstream, because ContingencyTable is just a 2-or-higher-dimensional Counts array, and ContingencyMatrix is just a 2-or-higher-dimensional Probabilities.

Datseris commented 1 year ago

I am not a fan of even more API changes, because I don't find convincing arguments for them so far.

I don't see how c[1:end-1, :, Name(c.outcomes[3][3])] is simpler than c[1:end-1, :, 3]. It appears to me that you want to keep referencing things that do not have a conceptual referencing reason: just as probabilities is agnostic to where the counts came from, the same the contingency table should be agnostic to where the counts came from.

I don't see a reason to use Counts{N} instead of Array{Integer, N} (in fact, why do you need to contigency table type in the first place?).

Probabilities is already parameterized by N so that shouldn't be a problem if you want to put N as type parameter of Probbabilities.

kahaaga commented 1 year ago

Rephrased suggestion

Sorry about the initial issue description - I realise now that it was a lot of rambling and little about what I actually wanted to do. My suggestion here, when applied to the changes agreed on in #313, is as follows:

struct Probabilities{T, N, AT} <: AbstractArray{T, N}
    probs::AT
    function Probabilities(x::AbstractArray{T, N}) where {T, N}
        return new{T, N, typeof(x)}(x)
    end
end

struct Counts{T <: Integer, N, AT} <: AbstractArray{T, N}
    cts::AT
    function Counts(x::AbstractArray{T, N}) where {T, N}
        return new{T, N, typeof(x)}(x)
    end
end

"""
    counts(o::CountBasedOutcomeSpace, x) -> cts::Counts{1}

Compute the counts  (a one-dimensional contingency table) over `o` given `x`.
"""
function counts(::OutcomeSpace, x) end

""" 
    probabilities(est::ProbabilitiesEstimator, counts::Counts) -> p::Probabilities

Convert given `counts` to `Probabilities` by using the given `ProbabilitiesEstimator`.

If `counts` is `N`-dimensional, then `p` will be `N`-dimensional. This is useful
when computing multivariate measures such as conditional mutual information.
"""
function probabilities(est::ProbabilitiesEstimator, x::Counts) end

# Not needed in ComplexityMeasures.jl but needed in CausalityTools for multidimensional probability estimation
"""
    counts(os::PerPointEncoding{N}, x::Vararg{T, N}) where {T, N} -> cts::Counts{N}
    counts(os::PerVariableEncoding{N}, x::Vararg{T, N}) where {T, N} -> cts::Counts{N}

Compute the counts (an `N`-dimensional contingency table) over `o` given `x`.
"""
function counts(os::PerPointEncoding{N}, x::Vararg{T, N}) where {T, N}; end
function counts(os::PerVariableEncoding{N}, x::Vararg{T, N}) where {T, N}; end

This is as generic as it gets.

Other Julia packages provide functions for estimating probabilities-based measures. However, we provide an API for doing so, which is thoroughly thought out.

Other packages provide functionality for counting too. However, they essentially just provide what is our UniqueElements outcome space. We can provide an API for counting, compatible with any sort of discretization, which can be used upstream. Since we realised probabilities aren't the most fundamental quantity, but counts are, Counts should also be a type.

Both Counts and Probabilities should exist and work in any dimension (Probabilities already does).

why do you need to contingency table type in the first place I don't see a reason to use Counts{N} instead of Array{Integer, N}

We don't strictly need Counts or ContingencyTable (it's multivariate analogue). We also don't strictly need Probabilities or ContingencyMatrix (its multivariate analogue). We could just use Array{<:Integer} for counts and Array{<:Real} for probabilities. It would, however, lead to a much harder-to-understand codebase.

Some points for having both names:

The namesCounts and Probabilities both have conceptual meaning (map 1-to-1 to well-established concepts from the literature), and can be used for dispatch and easier-to-understand code.
There are multiple ways of estimating multidimensional probabilities (e.g. Fernandes et al., 2010) from contingency tables, depending on what you assume the dependency relationship between the marginals are (independent, or with some functional dependency). In the future, we could add such decided ProbabilitiesEstimators that are dedicated to higher-dimensional Counts, which takes into consideration assumed dependencies between marginals.
Avoiding code duplication in CausalityTools. Both concepts (counts and probabilities) are used in both packages. The only difference is that ComplexityMeasures.jl uses one-dimensional counts and probabilities, while CausalityTools.jl uses higher-dimensional counts and probabilities.
In CausalityTools.jl would be far easier for a user to only deal with only two generic types - Counts + Probabilities, rather than AbstractVector{<:Int} + Probabilities + ContingencyTable + ContingencyMatrix (the latter two are needed upstream anyways, so why not merge).

I don't see a reason to use Counts{N} instead of Array{Integer, N}

Input data can be Array{<:Integer, N}. We need a way to distinguish between the case where the input is to be treated as a contingency table, versus when the input is to be processed as-is.

References

Fernandes, A. D., & Gloor, G. B. (2010). Mutual information is critically dependent on prior assumptions: would the correct estimate of mutual information please identify itself?. Bioinformatics, 26(9), 1135-1139.

kahaaga commented 1 year ago

It appears to me that you want to keep referencing things that do not have a conceptual referencing reason: just as probabilities is agnostic to where the counts came from, the same the contingency table should be agnostic to where the counts came from.

As hinted to in my answer above, the contingency table can be agnostic to where the counts came from. However, it isn't very useful for practical purposes. See for example https://en.wikipedia.org/wiki/Contingency_table. The names of the outcomes are written in the marginals, so that one can reason about the counts with regular language. The same goes for probabilities arrays.

Since we estimate counts from outcome spaces with known elements, it is only natural that this information is contained along with the counts/probabilities. A NamedArray is a good way of doing so, because it has essentially no performance penalty.

I don't see how c[1:end-1, :, Name(c.outcomes[3][3])] is simpler than c[1:end-1, :, 3]

You're absolutely right, it isn't. you right about that. But c[:, :, SVector(1, 1)] is easier than c[:, :, findfirst(x -> x == SVector(1, 1), outs)]. And the indexing isn't really relevant here - it is a convenience you get if you use NamedArrays. There no reason one has to do so. Using Array works just fine for the purpose of computing pmf functionals, as we've already demonstrated with the existing Probabilities. It is, however, extremely convenient when visualising what's actually beeing computed:

# I immediately see that `x[2, 3]` is the number of times `"a"` and `[1, 2]` occurs together. 
# I get this information for free while constructing the counts table.
julia> x
3×2 Named Matrix{Int64}
A ╲ B │ [2, 1]  [1, 2]
──────┼───────────────
b     │      3       2
a     │      3       1
2     │      0       3

# Contrast that to the following.  What is `x[2, 3]`? Okay, it is a 1 count. But what outcomes
#  does that correspond to? I don't know. I'd have to retrace my steps, compute outcomes
# manually, and use `findfirst` individually for each outcome set (once per dimension). 
julia> x
3×2 Matrix{Int64}:
 3  2
 3  1
 0  3

Datseris commented 1 year ago

Since we realised probabilities aren't the most fundamental quantity, but counts are, Counts should also be a type.

This isn't an argument by itself. Something should be a new type in only one of two cases: 1) if there is no existing data that offers identical functionality, 2) to avoid type piracy. Neither happens here because array{Integer} has identical functionality, and because we have our own functions we cannot commit type piracy even if we wanted to.

We also don't strictly need Probabilities or ContingencyMatrix (its multivariate analogue).

No we do. We do need a type that ensures and guarantees that the probabilities sum to 1. There is no existing type we can use for this.

In CausalityTools.jl would be far easier for a user to only deal with only two generic types - Counts + Probabilities, rather than AbstractVector{<:Int} + Probabilities + ContingencyTable + ContingencyMatrix (the latter two are needed upstream anyways, so why not merge).

Sure, there is no problem with having the contigency table be a named array. The problem is, why stop there? Why wouldn't "probabiliies" be a named array as well? With the outcomes being the names of the rows?

Well, lol, this would allow us to completely remove the probabilities_and_outcomes function. We would only hacve a probabilities function ,as the "named array" would also have the outcomes. Similarly, the counts_and_outcomes function would not exist anymore. And the outcomes function is a trivial wrapper that just gives the values of the dimension of the probabilities/counts .

My only problem is that in Julia, unlike in Python, the community has not yet settled down to a dimensional data package... It seems that NamedArrays.jl and DimensionalData.jl are the two main contenders. Have you made any considerations when choosing NamedArrays.jl or was it by chance?

Datseris commented 1 year ago

NamedArrays.jl is simpler and with better printing, but much less performant and featureful than DimensionalData.jl. DimensionalData.jl uses type inference to make indexing fast. (this means that compilation time is higher for dimensional data.jl).

kahaaga commented 1 year ago

My only problem is that in Julia, unlike in Python, the community has not yet settled down to a dimensional data package... It seems that NamedArrays.jl and DimensionalData.jl are the two main contenders. Have you made any considerations when choosing NamedArrays.jl or was it by chance?

I had no particular reason for using NamedArrays - it's just the first thing I found that looked well-maintained and did the job of having labels on the marginals when printing.

DimensionalData.jl also looks good (better?). I need to try it out a bit and see how it does for the higher-dimensional arrays.

Anyways, the point of my suggestion is is to make the implementation of Counts/Probabilities generic, so we can use any array type. All code, until now, uses regular indexing anyways. So both Array, NamedArray and DimensionalData would work. It just that the latter two array types give the convenience of having indeable axis labels.

This isn't an argument by itself. Something should be a new type in only one of two cases: 1) if there is no existing data that offers identical functionality, 2) to avoid type piracy. Neither happens here because array{Integer} has identical functionality, and because we have our own functions we cannot commit type piracy even if we wanted to.

Yep, you're right. However, what we don't have is a way of distinguishing arrays of counts (to be treated as contingency tables) versus arrays of integers (to be treated arbitrarily, depending on the algorithm).

No we do. We do need a type that ensures and guarantees that the probabilities sum to 1. There is no existing type we can use for this.

That's true. Good point.

Sure, there is no problem with having the contigency table be a named array. The problem is, why stop there? Why wouldn't "probabiliies" be a named array as well? With the outcomes being the names of the rows?

Exactly! Why shouldn't it be? I think it should. If a Counts array is constructed from a Probabilities array, then the resulting Probabilities inherits the dimensional information.

EDIT: sorry, the other way around. If a Probabilities array is constructed from a Counts array, then it inherits dimensional information.

Well, lol, this would allow us to completely remove the probabilities_and_outcomes function. We would only have a probabilities function ,as the "named array" would also have the outcomes. Similarly, the counts_and_outcomes function would not exist anymore. And the outcomes function is a trivial wrapper that just gives the values of the dimension of the probabilities/counts.

Yes!!! Much simpler. This will remove a bunch of functions from the existing codebase. allprobabilities will need a small adjustment, but we get the rest essentially for free, since the code for getting the outcomes is already there in counts_and_outcomes/probabilities_and_outcomes.

kahaaga commented 1 year ago

I like DimArray better, but printing fails (for reasons I can't immediately decipher) for use cases we'd commonly see.

For example, assume we discretise three input variables using OrdinalPatterns for the first variable, UniqueElements for the second variable, and ValueHistogram for the third variable.

DimArray can be constructed, but printing to console fails.

julia> outs_x, outs_y, outs_z = SVector.([(1, 1), (1, 2), (2, 1), (2,2)]), ['a', 2], SVector.([(i/10,) for i = 1:9])
(SVector{2, Int64}[[1, 1], [1, 2], [2, 1], [2, 2]], Any['a', 2], 1:9)

julia> r = DimArray(rand(4, 2, 9), (x = outs_x, y = outs_y, z = outs_z)); # works

julia > r # showing `r` fails

NamedArray works as expected:

julia> NamedArray(rand(4, 2, 9), (outs_x, outs_y, outs_z))
4×2×9 Named Array{Float64, 3}

[:, :, C=[0.1]] =
 A ╲ B │        a         2
───────┼───────────────────
[1, 1] │ 0.332165  0.535625
⋮               ⋮         ⋮
[2, 2] │  0.27622  0.397477

[:, :, C=[0.2]] =
 A ╲ B │        a         2
───────┼───────────────────
[1, 1] │ 0.536456  0.708727
⋮               ⋮         ⋮
[2, 2] │ 0.993424  0.555581

[:, :, C=[0.3]] =
 A ╲ B │        a         2
───────┼───────────────────
[1, 1] │ 0.435666  0.834049
⋮               ⋮         ⋮
[2, 2] │ 0.301785  0.625569

[:, :, C=[0.4]] =
 A ╲ B │        a         2
───────┼───────────────────
[1, 1] │ 0.926434  0.626023
⋮               ⋮         ⋮
[2, 2] │  0.41063  0.227371

[:, :, C=[0.5]] =
 A ╲ B │        a         2
───────┼───────────────────
[1, 1] │  0.35522  0.373788
⋮               ⋮         ⋮
[2, 2] │ 0.716308  0.678094
⋮

Datseris commented 1 year ago

what's the error message?

kahaaga commented 1 year ago

julia> r
4×2×9 DimArray{Float64,3} with dimensions:
  Dim{:x} Sampled{SVector{2, Int64}} SVector{2, Int64}[[1, 1], [1, 2], [2, 1], [2, 2]] ForwardOrdered Irregular Points,
  Dim{:y} Categorical{Any} Any['a', 2] Unordered,
  Dim{:z} Sampled{Int64} 1:9 ForwardOrdered Regular Points
[:, :, 1]
           'a'      2
 Error showing value of type DimArray{Float64, 3, Tuple{Dim{:x, DimensionalData.Dimensions.LookupArrays.Sampled{SVector{2, Int64}, Vector{SVector{2, Int64}}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Irregular{Tuple{Nothing, Nothing}}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:y, DimensionalData.Dimensions.LookupArrays.Categorical{Any, Vector{Any}, DimensionalData.Dimensions.LookupArrays.Unordered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:z, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Array{Float64, 3}, DimensionalData.NoName, DimensionalData.Dimensions.LookupArrays.NoMetadata}:
ERROR: MethodError: no method matching iterate(::DimensionalData.ShowWith)

Closest candidates are:
  iterate(::AbstractString, ::Integer)
   @ Base strings/basic.jl:157
  iterate(::Base.RegexMatchIterator)
   @ Base regex.jl:686
  iterate(::Base.RegexMatchIterator, ::Any)
   @ Base regex.jl:686
  ...

Stacktrace:
  [1] Stateful
    @ Base ./iterators.jl:1412 [inlined]
  [2] escape_string(io::IOContext{IOBuffer}, s::DimensionalData.ShowWith, esc::Tuple{Char, Char}; keep::Tuple{})
    @ Base ./strings/io.jl:407
  [3] escape_string(io::IOContext{IOBuffer}, s::DimensionalData.ShowWith, esc::Tuple{Char, Char})
    @ Base ./strings/io.jl:406
  [4] print_quoted(io::IOContext{IOBuffer}, s::DimensionalData.ShowWith)
    @ Base ./strings/io.jl:440
  [5] show(io::IOContext{IOBuffer}, s::DimensionalData.ShowWith)
    @ Base ./strings/io.jl:197
  [6] sprint(f::Function, args::DimensionalData.ShowWith; context::IOContext{Base.TTY}, sizehint::Int64)
    @ Base ./strings/io.jl:112
  [7] print_matrix_row(io::IOContext{…}, X::AbstractVecOrMat, A::Vector{…}, i::Int64, cols::Vector{…}, sep::String, idxlast::Int64)
    @ Base ./arrayshow.jl:112
  [8] _print_matrix(io::IOContext{…}, X::AbstractVecOrMat, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64, rowsA::UnitRange{…}, colsA::UnitRange{…})
    @ Base ./arrayshow.jl:213
  [9] print_matrix(io::IOContext{…}, X::Matrix{…}, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64)
    @ Base ./arrayshow.jl:171
 [10]
    @ Base ./arrayshow.jl:171 [inlined]
 [11] _print_matrix(io::IOContext{…}, A::SubArray{…}, lookups::Tuple{…})
    @ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/show.jl:125
 [12] print_array(io::IOContext{…}, mime::MIME{…}, A::DimArray{…})
    @ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/show.jl:52
 [13] show_after
    @ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/show.jl:35 [inlined]
 [14] show(io::IOContext{…}, mime::MIME{…}, A::DimArray{…})
    @ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/show.jl:27
 [15] (::OhMyREPL.var"#15#16"{REPL.REPLDisplay{…}, MIME{…}, Base.RefValue{…}})(io::IOContext{Base.TTY})
    @ OhMyREPL ~/.julia/packages/OhMyREPL/h1QCu/src/output_prompt_overwrite.jl:23
 [16] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:554
 [17] display
    @ REPL ~/.julia/packages/OhMyREPL/h1QCu/src/output_prompt_overwrite.jl:6 [inlined]
 [18] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
    @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:278 [inlined]
 [19] display(x::Any)
    @ Base.Multimedia ./multimedia.jl:340
 [20] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
    @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:0
 [21] (::REPL.var"#57#58"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
    @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:284
 [22] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
    @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:554
 [23] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
    @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:282
 [24] (::REPL.var"#do_respond#80"{…})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
    @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:896
 [25] #invokelatest#2
    @ Base ./essentials.jl:887 [inlined]
 [26] invokelatest
    @ Base ./essentials.jl:884 [inlined]
 [27] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
    @ REPL.LineEdit /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/LineEdit.jl:2656
 [28] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
    @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:1297
 [29] (::REPL.var"#62#68"{REPL.LineEditREPL, REPL.REPLBackendRef})()
    @ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:373
Some type information was truncated. Use `show(err)` to see complete types.

Datseris commented 1 year ago

yeah bug in the package, probably open an issue there

kahaaga commented 1 year ago

Follow the issue here: https://github.com/rafaqz/DimensionalData.jl/issues/525

kahaaga commented 1 year ago

If going for some sort of named array in favor or counts_and_outcomes/probabilities_and_outcomes, there's going to a trade-off between using NamedArrays and DimensionalData. The first is printing.

Printing

DimArray prints terser outputs (just showing a single slice of the higher-dimensinal array), while NamedArray shows multiple, very-nicely-summarized slices.
DimArray lacks the convenient dimension-name indicator in the top-left corner that NamedArray has. Showing the dimension names like that is extremely common in the world of contingency tables, so here I favor NamedArray.
DimArray doesn't print lines, NamedArray does. I'm not sure which alternative I like the best.

julia> DimArray(rand(X(['a', 'b', 'c']), Y(2:6), Z([1, 2, 6, 7])))
3×5×4 DimArray{Float64,3} with dimensions:
  X Categorical{Char} Char['a', 'b', 'c'] ForwardOrdered,
  Y Sampled{Int64} 2:6 ForwardOrdered Regular Points,
  Z Sampled{Int64} Int64[1, 2, 6, 7] ForwardOrdered Irregular Points
[:, :, 1]
       2         3         4         5         6
  'a'  0.976237  0.498344  0.560187  0.419395  0.660773
  'b'  0.961241  0.606697  0.193405  0.781055  0.59499
  'c'  0.76248   0.784371  0.165449  0.617995  0.530311
[and 3 more slices...]

julia> NamedArray(rand(3, 5, 4), (['a', 'b', 'c'], 2:6, [1, 2, 6, 7]))
3×5×4 Named Array{Float64, 3}

[:, :, C=1] =
A ╲ B │         2          3          4          5          6
──────┼──────────────────────────────────────────────────────
a     │  0.221985   0.443299   0.172597   0.297715   0.250977
b     │  0.782418   0.134765   0.639052  0.0848428   0.173058
c     │  0.241489   0.348311   0.981556   0.299584   0.884155

[:, :, C=2] =
A ╲ B │        2         3         4         5         6
──────┼─────────────────────────────────────────────────
a     │ 0.570002  0.638562   0.89299  0.304665  0.709846
b     │ 0.981045  0.251941  0.840485  0.923528  0.841089
c     │ 0.392437  0.828573  0.759668  0.933569  0.360121

[:, :, C=6] =
A ╲ B │         2          3          4          5          6
──────┼──────────────────────────────────────────────────────
a     │   0.32179   0.906445   0.720825   0.972123   0.410204
b     │  0.516054   0.162833    0.24036   0.671541  0.0449559
c     │  0.410153   0.375167   0.244368  0.0386752   0.740858

[:, :, C=7] =
A ╲ B │        2         3         4         5         6
──────┼─────────────────────────────────────────────────
a     │ 0.541841  0.814869  0.795814  0.160063   0.93049
b     │ 0.260391  0.177302   0.21647  0.577456  0.955054
c     │ 0.871144  0.233323  0.289378  0.013904  0.584646

Datseris commented 1 year ago

Sure, named array is fine I guess. We should really try to use these data types as little as possible though.

kahaaga commented 1 year ago

I just did some performance testing too. Post will follow in a few minutes.

kahaaga commented 1 year ago

Performance

# simulate computing some three-variable probability-based measure, e.g. cmi
# this typically involves summing over certain dimensions multiple times, and just
# simple counting
function test_sum(c::AbstractArray{T, 3}) where {T}
    cmi = 0.0
    ds = size(c)
    @inbounds for i = 1:ds[1]
        tmp = 0.0
        for j = 1:ds[2]
            for k = 1:ds[3]
                cmi += c[i, j, k]
            end
            tmp = @views sum(c[i, j, :])
        end
        cmi += tmp
    end
    return cmi
end

# Simulate the worst-possible scenario where at least one marginal has element type `Any` 
# (due to e.g. using `UniqueElements` on some poorly preprocessed data)
julia> outs_x, outs_y, outs_z = SVector.([(1, 1), (1, 2), (2, 1), (2,2)]), ['a', 2, 'c', 'd', 'e'], SVector.([(i/10,) for i = 1:9]);

julia> using StableRNGs; rng = StableRNG(1234); x = rand(rng, 4, 5, 9);

julia> d = DimArray(x, (x = outs_x, y = outs_y, z = outs_z));

julia> n = NamedArray(x, (outs_x, outs_y, outs_z));

julia> julia> test_sum(n); @btime test_sum($n)
  75.042 μs (680 allocations: 52.19 KiB)
103.44971515371522

julia> test_sum(d); @btime test_sum($d)
  188.517 ns (0 allocations: 0 bytes)
103.44971515371522

# EDIT: for regular arrays
julia> a = Array(d); test_sum(a); @btime test_sum($a)
  179.107 ns (0 allocations: 0 bytes)
103.44971515371522

kahaaga commented 1 year ago

For this (relatively realistic) use case there's a ~500 times performance benefit to using DimArray. I think this fact should trump any printing preferences.

kahaaga commented 1 year ago

See my edit of the performance test above: using DimArray is essentially as performant as using Array, so using DimArray for Counts or Probabilities should be fine.

Datseris commented 1 year ago

Yes, this is well known, sorry, I should have told you: DimArray package has a miniscule overhead of one a couple more function calls when it comes to array operations. It is a strongly typed package. It really is "just arrays" wrapped in a dimensional representation. But wow, I didn't know NamedArrays was so bad.

Datseris commented 1 year ago

well, anyways, it is clear that we will be using DimArray. This is convenient for me, I use the same package in ClimateBase.jl.

Datseris commented 1 year ago

however, I think contigency table calculations shold not be in this package, but rather in causality tools. I mean: the Counts type still is here, but there isn't an API-based way to obrtain multi-dimensional counts.

kahaaga commented 1 year ago

well, anyways, it is clear that we will be using DimArray. This is convenient for me, I use the same package in ClimateBase.jl. however, I think contigency table calculations shold not be in this package, but rather in causality tools. I mean: the Counts type still is here, but there isn't an API-based way to obtain multi-dimensional counts.

Do we agree on the following summary?

Counts{T, N} is defined in ComplexityMeasures.jl . It is just a simple wrapper of AbstractArray{<:Int, N}.
Probabilities{T, N} is defined in ComplexityMeasures.jl (with no changes), and is also used as-is in CausalityTools.jl
Counts and Probabilities are defined as in comment https://github.com/JuliaDynamics/ComplexityMeasures.jl/issues/314#issuecomment-1696458053
ComplexityMeasures.jl defines counts and probabilities for one-dimensional data.
CausalityTools.jl extends counts and probabilities for multi-dimensional data.
counts_and_outcomes and probabilities_and_outcomes are dropped in ComplexityMeasures.jl in favor of using DimArrays where the outcome are the marginal labels.

Datseris commented 1 year ago

Sure, but the wrapped array should be DimensionalArray, not just Array.

* `counts_and_outcomes` and `probabilities_and_outcomes` are dropped in ComplexityMeasures.jl  in favor of using `DimArray`s where the outcome are the marginal labels.

They should be deprecated, not dropped. A generic deprecated function definition should firsst throw a wraning and then return probabilities(..), outcomes(..).

kahaaga commented 1 year ago

Sure, but the wrapped array should be DimensionalArray, not just Array. They should be deprecated, not dropped. A generic deprecated function definition should firsst throw a wraning and then return probabilities(..), outcomes(..).

Yep, you're right about both!

kahaaga commented 1 year ago

I'll then implement these changes as part of solving #313, since this anyways require some major refactoring, and I don't want to refactor twice.

kahaaga commented 1 year ago

@Datseris, What's your preference on how the Counts/Probabilities are shown? Here are two alternatives:

5×5 Counts{Int, 2}:
     5   6   7   8   9
 1  72   3  34  56  76
 2  50  34  93  36  10
 3  85  69  83  43  70
 4  66  98  77  98  78
 5  94  82  74  28  86

Counts, represented as 5×5 DimArray{Int64,2} with dimensions:
  Dim{:x} Sampled{Int64} 1:5 ForwardOrdered Regular Points,
  Dim{:y} Sampled{Int64} 5:9 ForwardOrdered Regular Points
     5   6   7   8   9
 1  72   3  34  56  76
 2  50  34  93  36  10
 3  85  69  83  43  70
 4  66  98  77  98  78
 5  94  82  74  28  86

The formatting is lost here; the marginal symbols are colored in a darker color than the matrix entries.

DimArrays displays some information that I think is unnecessary for us to care about (what type of internal representation the marginal labels have in DimensionalData.jl). We define in OutcomeSpace/Encoding what the type of the outcomes is, so I think it is redundant to show this information. I therefore favour the first alternative.

Datseris commented 1 year ago

first option please!

JuliaDynamics / ComplexityMeasures.jl

Add `Counts` type, and allow *any* `AbstractArray` type for both `Counts` and `Probabilities` #314

More generally

Rephrased suggestion

References

Printing

Performance

Add `Counts` type, and allow any `AbstractArray` type for both `Counts` and `Probabilities` #314