Closed kahaaga closed 1 year ago
I am not a fan of even more API changes, because I don't find convincing arguments for them so far.
I don't see how c[1:end-1, :, Name(c.outcomes[3][3])]
is simpler than c[1:end-1, :, 3]
. It appears to me that you want to keep referencing things that do not have a conceptual referencing reason: just as probabilities is agnostic to where the counts came from, the same the contingency table should be agnostic to where the counts came from.
I don't see a reason to use Counts{N}
instead of Array{Integer, N}
(in fact, why do you need to contigency table type in the first place?).
Probabilities
is already parameterized by N
so that shouldn't be a problem if you want to put N
as type parameter of Probbabilities
.
Sorry about the initial issue description - I realise now that it was a lot of rambling and little about what I actually wanted to do. My suggestion here, when applied to the changes agreed on in #313, is as follows:
struct Probabilities{T, N, AT} <: AbstractArray{T, N}
probs::AT
function Probabilities(x::AbstractArray{T, N}) where {T, N}
return new{T, N, typeof(x)}(x)
end
end
struct Counts{T <: Integer, N, AT} <: AbstractArray{T, N}
cts::AT
function Counts(x::AbstractArray{T, N}) where {T, N}
return new{T, N, typeof(x)}(x)
end
end
"""
counts(o::CountBasedOutcomeSpace, x) -> cts::Counts{1}
Compute the counts (a one-dimensional contingency table) over `o` given `x`.
"""
function counts(::OutcomeSpace, x) end
"""
probabilities(est::ProbabilitiesEstimator, counts::Counts) -> p::Probabilities
Convert given `counts` to `Probabilities` by using the given `ProbabilitiesEstimator`.
If `counts` is `N`-dimensional, then `p` will be `N`-dimensional. This is useful
when computing multivariate measures such as conditional mutual information.
"""
function probabilities(est::ProbabilitiesEstimator, x::Counts) end
# Not needed in ComplexityMeasures.jl but needed in CausalityTools for multidimensional probability estimation
"""
counts(os::PerPointEncoding{N}, x::Vararg{T, N}) where {T, N} -> cts::Counts{N}
counts(os::PerVariableEncoding{N}, x::Vararg{T, N}) where {T, N} -> cts::Counts{N}
Compute the counts (an `N`-dimensional contingency table) over `o` given `x`.
"""
function counts(os::PerPointEncoding{N}, x::Vararg{T, N}) where {T, N}; end
function counts(os::PerVariableEncoding{N}, x::Vararg{T, N}) where {T, N}; end
This is as generic as it gets.
Other Julia packages provide functions for estimating probabilities-based measures. However, we provide an API for doing so, which is thoroughly thought out.
Other packages provide functionality for counting too. However, they essentially just provide what is our UniqueElements
outcome space. We can provide an API for counting, compatible with any sort of discretization, which can be used upstream. Since we realised probabilities aren't the most fundamental quantity, but counts are, Counts
should also be a type.
Both Counts
and Probabilities
should exist and work in any dimension (Probabilities
already does).
why do you need to contingency table type in the first place I don't see a reason to use Counts{N} instead of Array{Integer, N}
We don't strictly need Counts
or ContingencyTable
(it's multivariate analogue). We also don't strictly need Probabilities
or ContingencyMatrix
(its multivariate analogue). We could just use Array{<:Integer}
for counts and Array{<:Real}
for probabilities. It would, however, lead to a much harder-to-understand codebase.
Some points for having both names:
Counts
and Probabilities
both have conceptual meaning (map 1-to-1 to well-established concepts from the literature), and can be used for dispatch and easier-to-understand code.ProbabilitiesEstimator
s that are dedicated to higher-dimensional Counts
, which takes into consideration assumed dependencies between marginals.Counts
+ Probabilities
, rather than AbstractVector{<:Int}
+ Probabilities
+ ContingencyTable
+ ContingencyMatrix
(the latter two are needed upstream anyways, so why not merge).I don't see a reason to use
Counts{N}
instead ofArray{Integer, N}
Input data can be Array{<:Integer, N}
. We need a way to distinguish between the case where the input is to be treated as a contingency table, versus when the input is to be processed as-is.
Fernandes, A. D., & Gloor, G. B. (2010). Mutual information is critically dependent on prior assumptions: would the correct estimate of mutual information please identify itself?. Bioinformatics, 26(9), 1135-1139.
It appears to me that you want to keep referencing things that do not have a conceptual referencing reason: just as probabilities is agnostic to where the counts came from, the same the contingency table should be agnostic to where the counts came from.
As hinted to in my answer above, the contingency table can be agnostic to where the counts came from. However, it isn't very useful for practical purposes. See for example https://en.wikipedia.org/wiki/Contingency_table. The names of the outcomes are written in the marginals, so that one can reason about the counts with regular language. The same goes for probabilities arrays.
Since we estimate counts from outcome spaces with known elements, it is only natural that this information is contained along with the counts/probabilities. A NamedArray
is a good way of doing so, because it has essentially no performance penalty.
I don't see how c[1:end-1, :, Name(c.outcomes[3][3])] is simpler than c[1:end-1, :, 3]
You're absolutely right, it isn't. you right about that. But c[:, :, SVector(1, 1)]
is easier than c[:, :, findfirst(x -> x == SVector(1, 1), outs)]
. And the indexing isn't really relevant here - it is a convenience you get if you use NamedArray
s. There no reason one has to do so. Using Array
works just fine for the purpose of computing pmf functionals, as we've already demonstrated with the existing Probabilities
. It is, however, extremely convenient when visualising what's actually beeing computed:
# I immediately see that `x[2, 3]` is the number of times `"a"` and `[1, 2]` occurs together.
# I get this information for free while constructing the counts table.
julia> x
3×2 Named Matrix{Int64}
A ╲ B │ [2, 1] [1, 2]
──────┼───────────────
b │ 3 2
a │ 3 1
2 │ 0 3
# Contrast that to the following. What is `x[2, 3]`? Okay, it is a 1 count. But what outcomes
# does that correspond to? I don't know. I'd have to retrace my steps, compute outcomes
# manually, and use `findfirst` individually for each outcome set (once per dimension).
julia> x
3×2 Matrix{Int64}:
3 2
3 1
0 3
Since we realised probabilities aren't the most fundamental quantity, but counts are,
Counts
should also be a type.
This isn't an argument by itself. Something should be a new type in only one of two cases: 1) if there is no existing data that offers identical functionality, 2) to avoid type piracy. Neither happens here because array{Integer}
has identical functionality, and because we have our own functions we cannot commit type piracy even if we wanted to.
We also don't strictly need
Probabilities
orContingencyMatrix
(its multivariate analogue).
No we do. We do need a type that ensures and guarantees that the probabilities sum to 1. There is no existing type we can use for this.
In CausalityTools.jl would be far easier for a user to only deal with only two generic types -
Counts
+Probabilities
, rather thanAbstractVector{<:Int}
+Probabilities
+ContingencyTable
+ContingencyMatrix
(the latter two are needed upstream anyways, so why not merge).
Sure, there is no problem with having the contigency table be a named array. The problem is, why stop there? Why wouldn't "probabiliies" be a named array as well? With the outcomes being the names of the rows?
Well, lol, this would allow us to completely remove the probabilities_and_outcomes
function. We would only hacve a probabilities
function ,as the "named array" would also have the outcomes. Similarly, the counts_and_outcomes
function would not exist anymore. And the outcomes
function is a trivial wrapper that just gives the values of the dimension of the probabilities/counts .
My only problem is that in Julia, unlike in Python, the community has not yet settled down to a dimensional data package... It seems that NamedArrays.jl and DimensionalData.jl are the two main contenders. Have you made any considerations when choosing NamedArrays.jl or was it by chance?
NamedArrays.jl is simpler and with better printing, but much less performant and featureful than DimensionalData.jl. DimensionalData.jl uses type inference to make indexing fast. (this means that compilation time is higher for dimensional data.jl).
My only problem is that in Julia, unlike in Python, the community has not yet settled down to a dimensional data package... It seems that NamedArrays.jl and DimensionalData.jl are the two main contenders. Have you made any considerations when choosing NamedArrays.jl or was it by chance?
I had no particular reason for using NamedArray
s - it's just the first thing I found that looked well-maintained and did the job of having labels on the marginals when printing.
DimensionalData.jl also looks good (better?). I need to try it out a bit and see how it does for the higher-dimensional arrays.
Anyways, the point of my suggestion is is to make the implementation of Counts
/Probabilities
generic, so we can use any array type. All code, until now, uses regular indexing anyways. So both Array
, NamedArray
and DimensionalData
would work. It just that the latter two array types give the convenience of having indeable axis labels.
This isn't an argument by itself. Something should be a new type in only one of two cases: 1) if there is no existing data that offers identical functionality, 2) to avoid type piracy. Neither happens here because array{Integer} has identical functionality, and because we have our own functions we cannot commit type piracy even if we wanted to.
Yep, you're right. However, what we don't have is a way of distinguishing arrays of counts (to be treated as contingency tables) versus arrays of integers (to be treated arbitrarily, depending on the algorithm).
No we do. We do need a type that ensures and guarantees that the probabilities sum to 1. There is no existing type we can use for this.
That's true. Good point.
Sure, there is no problem with having the contigency table be a named array. The problem is, why stop there? Why wouldn't "probabiliies" be a named array as well? With the outcomes being the names of the rows?
Exactly! Why shouldn't it be? I think it should. If a Counts
array is constructed from a Probabilities
array, then the resulting Probabilities
inherits the dimensional information.
EDIT: sorry, the other way around. If a Probabilities
array is constructed from a Counts
array, then it inherits dimensional information.
Well, lol, this would allow us to completely remove the probabilities_and_outcomes function. We would only have a probabilities function ,as the "named array" would also have the outcomes. Similarly, the counts_and_outcomes function would not exist anymore. And the outcomes function is a trivial wrapper that just gives the values of the dimension of the probabilities/counts.
Yes!!! Much simpler. This will remove a bunch of functions from the existing codebase. allprobabilities
will need a small adjustment, but we get the rest essentially for free, since the code for getting the outcomes is already there in counts_and_outcomes
/probabilities_and_outcomes
.
I like DimArray
better, but printing fails (for reasons I can't immediately decipher) for use cases we'd commonly see.
For example, assume we discretise three input variables using OrdinalPatterns
for the first variable, UniqueElements
for the second variable, and ValueHistogram
for the third variable.
DimArray
can be constructed, but printing to console fails.
julia> outs_x, outs_y, outs_z = SVector.([(1, 1), (1, 2), (2, 1), (2,2)]), ['a', 2], SVector.([(i/10,) for i = 1:9])
(SVector{2, Int64}[[1, 1], [1, 2], [2, 1], [2, 2]], Any['a', 2], 1:9)
julia> r = DimArray(rand(4, 2, 9), (x = outs_x, y = outs_y, z = outs_z)); # works
julia > r # showing `r` fails
NamedArray
works as expected:
julia> NamedArray(rand(4, 2, 9), (outs_x, outs_y, outs_z))
4×2×9 Named Array{Float64, 3}
[:, :, C=[0.1]] =
A ╲ B │ a 2
───────┼───────────────────
[1, 1] │ 0.332165 0.535625
⋮ ⋮ ⋮
[2, 2] │ 0.27622 0.397477
[:, :, C=[0.2]] =
A ╲ B │ a 2
───────┼───────────────────
[1, 1] │ 0.536456 0.708727
⋮ ⋮ ⋮
[2, 2] │ 0.993424 0.555581
[:, :, C=[0.3]] =
A ╲ B │ a 2
───────┼───────────────────
[1, 1] │ 0.435666 0.834049
⋮ ⋮ ⋮
[2, 2] │ 0.301785 0.625569
[:, :, C=[0.4]] =
A ╲ B │ a 2
───────┼───────────────────
[1, 1] │ 0.926434 0.626023
⋮ ⋮ ⋮
[2, 2] │ 0.41063 0.227371
[:, :, C=[0.5]] =
A ╲ B │ a 2
───────┼───────────────────
[1, 1] │ 0.35522 0.373788
⋮ ⋮ ⋮
[2, 2] │ 0.716308 0.678094
⋮
what's the error message?
julia> r
4×2×9 DimArray{Float64,3} with dimensions:
Dim{:x} Sampled{SVector{2, Int64}} SVector{2, Int64}[[1, 1], [1, 2], [2, 1], [2, 2]] ForwardOrdered Irregular Points,
Dim{:y} Categorical{Any} Any['a', 2] Unordered,
Dim{:z} Sampled{Int64} 1:9 ForwardOrdered Regular Points
[:, :, 1]
'a' 2
Error showing value of type DimArray{Float64, 3, Tuple{Dim{:x, DimensionalData.Dimensions.LookupArrays.Sampled{SVector{2, Int64}, Vector{SVector{2, Int64}}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Irregular{Tuple{Nothing, Nothing}}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:y, DimensionalData.Dimensions.LookupArrays.Categorical{Any, Vector{Any}, DimensionalData.Dimensions.LookupArrays.Unordered, DimensionalData.Dimensions.LookupArrays.NoMetadata}}, Dim{:z, DimensionalData.Dimensions.LookupArrays.Sampled{Int64, UnitRange{Int64}, DimensionalData.Dimensions.LookupArrays.ForwardOrdered, DimensionalData.Dimensions.LookupArrays.Regular{Int64}, DimensionalData.Dimensions.LookupArrays.Points, DimensionalData.Dimensions.LookupArrays.NoMetadata}}}, Tuple{}, Array{Float64, 3}, DimensionalData.NoName, DimensionalData.Dimensions.LookupArrays.NoMetadata}:
ERROR: MethodError: no method matching iterate(::DimensionalData.ShowWith)
Closest candidates are:
iterate(::AbstractString, ::Integer)
@ Base strings/basic.jl:157
iterate(::Base.RegexMatchIterator)
@ Base regex.jl:686
iterate(::Base.RegexMatchIterator, ::Any)
@ Base regex.jl:686
...
Stacktrace:
[1] Stateful
@ Base ./iterators.jl:1412 [inlined]
[2] escape_string(io::IOContext{IOBuffer}, s::DimensionalData.ShowWith, esc::Tuple{Char, Char}; keep::Tuple{})
@ Base ./strings/io.jl:407
[3] escape_string(io::IOContext{IOBuffer}, s::DimensionalData.ShowWith, esc::Tuple{Char, Char})
@ Base ./strings/io.jl:406
[4] print_quoted(io::IOContext{IOBuffer}, s::DimensionalData.ShowWith)
@ Base ./strings/io.jl:440
[5] show(io::IOContext{IOBuffer}, s::DimensionalData.ShowWith)
@ Base ./strings/io.jl:197
[6] sprint(f::Function, args::DimensionalData.ShowWith; context::IOContext{Base.TTY}, sizehint::Int64)
@ Base ./strings/io.jl:112
[7] print_matrix_row(io::IOContext{…}, X::AbstractVecOrMat, A::Vector{…}, i::Int64, cols::Vector{…}, sep::String, idxlast::Int64)
@ Base ./arrayshow.jl:112
[8] _print_matrix(io::IOContext{…}, X::AbstractVecOrMat, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64, rowsA::UnitRange{…}, colsA::UnitRange{…})
@ Base ./arrayshow.jl:213
[9] print_matrix(io::IOContext{…}, X::Matrix{…}, pre::String, sep::String, post::String, hdots::String, vdots::String, ddots::String, hmod::Int64, vmod::Int64)
@ Base ./arrayshow.jl:171
[10]
@ Base ./arrayshow.jl:171 [inlined]
[11] _print_matrix(io::IOContext{…}, A::SubArray{…}, lookups::Tuple{…})
@ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/show.jl:125
[12] print_array(io::IOContext{…}, mime::MIME{…}, A::DimArray{…})
@ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/show.jl:52
[13] show_after
@ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/show.jl:35 [inlined]
[14] show(io::IOContext{…}, mime::MIME{…}, A::DimArray{…})
@ DimensionalData ~/.julia/packages/DimensionalData/4TpBG/src/array/show.jl:27
[15] (::OhMyREPL.var"#15#16"{REPL.REPLDisplay{…}, MIME{…}, Base.RefValue{…}})(io::IOContext{Base.TTY})
@ OhMyREPL ~/.julia/packages/OhMyREPL/h1QCu/src/output_prompt_overwrite.jl:23
[16] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
@ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:554
[17] display
@ REPL ~/.julia/packages/OhMyREPL/h1QCu/src/output_prompt_overwrite.jl:6 [inlined]
[18] display(d::REPL.REPLDisplay, mime::MIME{Symbol("text/plain")}, x::Any)
@ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:278 [inlined]
[19] display(x::Any)
@ Base.Multimedia ./multimedia.jl:340
[20] print_response(errio::IO, response::Any, show_value::Bool, have_color::Bool, specialdisplay::Union{Nothing, AbstractDisplay})
@ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:0
[21] (::REPL.var"#57#58"{REPL.LineEditREPL, Pair{Any, Bool}, Bool, Bool})(io::Any)
@ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:284
[22] with_repl_linfo(f::Any, repl::REPL.LineEditREPL)
@ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:554
[23] print_response(repl::REPL.AbstractREPL, response::Any, show_value::Bool, have_color::Bool)
@ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:282
[24] (::REPL.var"#do_respond#80"{…})(s::REPL.LineEdit.MIState, buf::Any, ok::Bool)
@ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:896
[25] #invokelatest#2
@ Base ./essentials.jl:887 [inlined]
[26] invokelatest
@ Base ./essentials.jl:884 [inlined]
[27] run_interface(terminal::REPL.Terminals.TextTerminal, m::REPL.LineEdit.ModalInterface, s::REPL.LineEdit.MIState)
@ REPL.LineEdit /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/LineEdit.jl:2656
[28] run_frontend(repl::REPL.LineEditREPL, backend::REPL.REPLBackendRef)
@ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:1297
[29] (::REPL.var"#62#68"{REPL.LineEditREPL, REPL.REPLBackendRef})()
@ REPL /Applications/Julia-1.10.app/Contents/Resources/julia/share/julia/stdlib/v1.10/REPL/src/REPL.jl:373
Some type information was truncated. Use `show(err)` to see complete types.
yeah bug in the package, probably open an issue there
Follow the issue here: https://github.com/rafaqz/DimensionalData.jl/issues/525
If going for some sort of named array in favor or counts_and_outcomes
/probabilities_and_outcomes
, there's going to a trade-off between using NamedArrays
and DimensionalData
. The first is printing.
DimArray
prints terser outputs (just showing a single slice of the higher-dimensinal array), while NamedArray
shows multiple, very-nicely-summarized slices. DimArray
lacks the convenient dimension-name indicator in the top-left corner that NamedArray
has. Showing the dimension names like that is extremely common in the world of contingency tables, so here I favor NamedArray
.DimArray
doesn't print lines, NamedArray
does. I'm not sure which alternative I like the best. julia> DimArray(rand(X(['a', 'b', 'c']), Y(2:6), Z([1, 2, 6, 7])))
3×5×4 DimArray{Float64,3} with dimensions:
X Categorical{Char} Char['a', 'b', 'c'] ForwardOrdered,
Y Sampled{Int64} 2:6 ForwardOrdered Regular Points,
Z Sampled{Int64} Int64[1, 2, 6, 7] ForwardOrdered Irregular Points
[:, :, 1]
2 3 4 5 6
'a' 0.976237 0.498344 0.560187 0.419395 0.660773
'b' 0.961241 0.606697 0.193405 0.781055 0.59499
'c' 0.76248 0.784371 0.165449 0.617995 0.530311
[and 3 more slices...]
julia> NamedArray(rand(3, 5, 4), (['a', 'b', 'c'], 2:6, [1, 2, 6, 7]))
3×5×4 Named Array{Float64, 3}
[:, :, C=1] =
A ╲ B │ 2 3 4 5 6
──────┼──────────────────────────────────────────────────────
a │ 0.221985 0.443299 0.172597 0.297715 0.250977
b │ 0.782418 0.134765 0.639052 0.0848428 0.173058
c │ 0.241489 0.348311 0.981556 0.299584 0.884155
[:, :, C=2] =
A ╲ B │ 2 3 4 5 6
──────┼─────────────────────────────────────────────────
a │ 0.570002 0.638562 0.89299 0.304665 0.709846
b │ 0.981045 0.251941 0.840485 0.923528 0.841089
c │ 0.392437 0.828573 0.759668 0.933569 0.360121
[:, :, C=6] =
A ╲ B │ 2 3 4 5 6
──────┼──────────────────────────────────────────────────────
a │ 0.32179 0.906445 0.720825 0.972123 0.410204
b │ 0.516054 0.162833 0.24036 0.671541 0.0449559
c │ 0.410153 0.375167 0.244368 0.0386752 0.740858
[:, :, C=7] =
A ╲ B │ 2 3 4 5 6
──────┼─────────────────────────────────────────────────
a │ 0.541841 0.814869 0.795814 0.160063 0.93049
b │ 0.260391 0.177302 0.21647 0.577456 0.955054
c │ 0.871144 0.233323 0.289378 0.013904 0.584646
Sure, named array is fine I guess. We should really try to use these data types as little as possible though.
I just did some performance testing too. Post will follow in a few minutes.
# simulate computing some three-variable probability-based measure, e.g. cmi
# this typically involves summing over certain dimensions multiple times, and just
# simple counting
function test_sum(c::AbstractArray{T, 3}) where {T}
cmi = 0.0
ds = size(c)
@inbounds for i = 1:ds[1]
tmp = 0.0
for j = 1:ds[2]
for k = 1:ds[3]
cmi += c[i, j, k]
end
tmp = @views sum(c[i, j, :])
end
cmi += tmp
end
return cmi
end
# Simulate the worst-possible scenario where at least one marginal has element type `Any`
# (due to e.g. using `UniqueElements` on some poorly preprocessed data)
julia> outs_x, outs_y, outs_z = SVector.([(1, 1), (1, 2), (2, 1), (2,2)]), ['a', 2, 'c', 'd', 'e'], SVector.([(i/10,) for i = 1:9]);
julia> using StableRNGs; rng = StableRNG(1234); x = rand(rng, 4, 5, 9);
julia> d = DimArray(x, (x = outs_x, y = outs_y, z = outs_z));
julia> n = NamedArray(x, (outs_x, outs_y, outs_z));
julia> julia> test_sum(n); @btime test_sum($n)
75.042 μs (680 allocations: 52.19 KiB)
103.44971515371522
julia> test_sum(d); @btime test_sum($d)
188.517 ns (0 allocations: 0 bytes)
103.44971515371522
# EDIT: for regular arrays
julia> a = Array(d); test_sum(a); @btime test_sum($a)
179.107 ns (0 allocations: 0 bytes)
103.44971515371522
For this (relatively realistic) use case there's a ~500 times performance benefit to using DimArray
. I think this fact should trump any printing preferences.
See my edit of the performance test above: using DimArray
is essentially as performant as using Array
, so using DimArray
for Counts
or Probabilities
should be fine.
Yes, this is well known, sorry, I should have told you: DimArray package has a miniscule overhead of one a couple more function calls when it comes to array operations. It is a strongly typed package. It really is "just arrays" wrapped in a dimensional representation. But wow, I didn't know NamedArrays was so bad.
well, anyways, it is clear that we will be using DimArray. This is convenient for me, I use the same package in ClimateBase.jl.
however, I think contigency table calculations shold not be in this package, but rather in causality tools. I mean: the Counts type still is here, but there isn't an API-based way to obrtain multi-dimensional counts.
well, anyways, it is clear that we will be using DimArray. This is convenient for me, I use the same package in ClimateBase.jl. however, I think contigency table calculations shold not be in this package, but rather in causality tools. I mean: the Counts type still is here, but there isn't an API-based way to obtain multi-dimensional counts.
Do we agree on the following summary?
Counts{T, N}
is defined in ComplexityMeasures.jl . It is just a simple wrapper of AbstractArray{<:Int, N}
. Probabilities{T, N}
is defined in ComplexityMeasures.jl (with no changes), and is also used as-is in CausalityTools.jlCounts
and Probabilities
are defined as in comment https://github.com/JuliaDynamics/ComplexityMeasures.jl/issues/314#issuecomment-1696458053counts
and probabilities
for one-dimensional data.counts
and probabilities
for multi-dimensional data.counts_and_outcomes
and probabilities_and_outcomes
are dropped in ComplexityMeasures.jl in favor of using DimArray
s where the outcome are the marginal labels.Sure, but the wrapped array should be DimensionalArray, not just Array.
* `counts_and_outcomes` and `probabilities_and_outcomes` are dropped in ComplexityMeasures.jl in favor of using `DimArray`s where the outcome are the marginal labels.
They should be deprecated, not dropped. A generic deprecated function definition should firsst throw a wraning and then return probabilities(..), outcomes(..)
.
Sure, but the wrapped array should be DimensionalArray, not just Array. They should be deprecated, not dropped. A generic deprecated function definition should firsst throw a wraning and then return probabilities(..), outcomes(..).
Yep, you're right about both!
I'll then implement these changes as part of solving #313, since this anyways require some major refactoring, and I don't want to refactor twice.
@Datseris, What's your preference on how the Counts
/Probabilities
are shown? Here are two alternatives:
5×5 Counts{Int, 2}:
5 6 7 8 9
1 72 3 34 56 76
2 50 34 93 36 10
3 85 69 83 43 70
4 66 98 77 98 78
5 94 82 74 28 86
or
Counts, represented as 5×5 DimArray{Int64,2} with dimensions:
Dim{:x} Sampled{Int64} 1:5 ForwardOrdered Regular Points,
Dim{:y} Sampled{Int64} 5:9 ForwardOrdered Regular Points
5 6 7 8 9
1 72 3 34 56 76
2 50 34 93 36 10
3 85 69 83 43 70
4 66 98 77 98 78
5 94 82 74 28 86
The formatting is lost here; the marginal symbols are colored in a darker color than the matrix entries.
DimArray
s displays some information that I think is unnecessary for us to care about (what type of internal representation the marginal labels have in DimensionalData.jl). We define in OutcomeSpace
/Encoding
what the type of the outcomes is, so I think it is redundant to show this information. I therefore favour the first alternative.
first option please!
In CausalityTools, my current implementation of
ContingencyTable
usesNamedArray
from NamedArrays.jl to store the counts. This allows naming the the indices of the count array.We can then do use either regular indices or outcomes to index the counts.
IMO this this is far easier to work with than needing to call
outcomes
separately and manually matching the indices with the outcomes. The named marginals make it immediately clear what is counted. There is also no performance penalties when it comes to computing with the named arrays - just some extra memory due to allocating the marginal names upon construction.More generally
This also leads me to think that perhaps we can implement the
Counts
type too, analogous toProbabilities
.Counts{N}
-> anN
-dimensional counts arrayProbabilities{N}
-> anN
-dimensional probabilities array, which is computed from aCounts
instance using someProbabilitiesEstimator
.Defining these two types in ComplexityMeasures.jl completely eliminates the need for custom
ContingencyTable
(counts) andContingencyMatrix
(probabilities) upstream, becauseContingencyTable
is just a 2-or-higher-dimensionalCounts
array, andContingencyMatrix
is just a 2-or-higher-dimensionalProbabilities
.