JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems
MIT License
48 stars 11 forks source link

Remove DimensionalData.jl dep #341

Closed kahaaga closed 6 months ago

kahaaga commented 6 months ago

Fixes #326.

Changes

TODO

codecov[bot] commented 6 months ago

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (ecd13a7) 87.71% compared to head (9b0b0c2) 88.73%.

Files Patch % Lines
src/core/outcome.jl 50.00% 6 Missing :warning:
src/core/counts.jl 92.30% 3 Missing :warning:
src/core/probabilities.jl 93.18% 3 Missing :warning:
.../spatial/spatial_ordinal/SpatialOrdinalPatterns.jl 83.33% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #341 +/- ## ========================================== + Coverage 87.71% 88.73% +1.02% ========================================== Files 78 79 +1 Lines 2019 2078 +59 ========================================== + Hits 1771 1844 +73 + Misses 248 234 -14 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Datseris commented 6 months ago

i think in this PR we should also make it possible to initialize probabilities without outcomes. In this scenario the outcomes become just the integers corresponding to the probabilities, the same as eachindex(p). So that users can still use the interface with having alreayd probabilities from elsewhere.

kahaaga commented 6 months ago

i think in this PR we should also make it possible to initialize probabilities without outcomes. In this scenario the outcomes become just the integers corresponding to the probabilities, the same as eachindex(p). So that users can still use the interface with having alreayd probabilities from elsewhere.

Yep - I'm already working on it.

The nice thing about using integers as the outcomes when no outcomes are specified is that we don't actually need to allocate n_outcome-length vectors for the outcomes on initialization of the Counts or Probabilities. We simply initialize them as ranges 1:n_outcomes, which are collected only when needed: either when printing, or in allprobabilities, where we need to sort them (which should be quite rare).

This addresses the performance issue mentioned in a comment you made in #326 about initializing millions of instances of Probabilities.

I've created a small wrapper type around the integers to simplify pretty printing (we want it to be clear when printing whether these outcomes are actual integers, or if they've been generically assigned). But lacking a clear extendable API as for AbstractArray, it is a bit more tricky to extend the integers than it is to extend an array (not such which methods to extent to allow the necessary behavior). So I might go for just using integers anyways to reduce the work needed.

kahaaga commented 6 months ago

@Datseris, would it be okay to have the following behaviour?

The behaviour would be analogous for probabilities/probabilities_and_outcomes.

This way, the user can decide whether to use the fast or slower version, depending on whether they need the outcomes or not.

kahaaga commented 6 months ago

Except the pretty printing, I'm done here according to the specification in my latest comment. I'll hold off further work here until #336 is merged to avoid ballooning the merge conflict too much.

kahaaga commented 6 months ago

@Datseris This is ready for review (but should not be merged until my comment below is addressed). The diff is quite large, since using Documenter v1 now errors on what user to only give warnings during the doc build. I've fixed every warning and every error in this PR.

Comments

kahaaga commented 6 months ago

Note: I put the multiscale stuff temporarily in a .txt file, so that Documenter doesn't include it in the doc build.

Datseris commented 6 months ago

@Datseris, would it be okay to have the following behaviour?

* `counts(o::OutcomeSpace, x)` returns counts whose outcomes are the generic range `1:n_outcomes` (to save computation time)

* `counts_and_outcomes(o::OutcomeSpace, x)` returns counts whose outcomes are explicitly computed from the data (slightly slower due to the extra mapping from data space to outcome space).

The behaviour would be analogous for probabilities/probabilities_and_outcomes.

This way, the user can decide whether to use the fast or slower version, depending on whether they need the outcomes or not.

This would create a huge amount of code change though, right? What's the status quo now? probabilities/counts always estimates the outcomes, and stores them in the returned type, right?

Datseris commented 6 months ago

I guess your proposal would make some methods faster by skipping the outcomes. But it would require us to write every singe function twice, once for probabilities, once for _and_outcomes. Realisticially, how many methods actually benefit from this by becoming faster...?

kahaaga commented 6 months ago

What's the status quo now? probabilities/counts always estimates the outcomes, and stores them in the returned type, right?

The current situation in this PR is that probabilities/counts estimates outcomes if they are near-free to compute. If the outcomes are expensive to compute, e.g. for ValueBinning, generic integer outcomes are assigned.

Currently, in order to guarantee that you get the proper outcomes, you have to use counts_and_outcomes and probabilities_and_outcomes. counts/probabilities may skip computing the outcomes if it is expensive to do so.

This would create a huge amount of code change though, right?

Compared to the changes we've made earlier, this wasn't too big of a change. Anyways: it's already done in this proposal.

I guess your proposal would make some methods faster by skipping the outcomes.

Yes, this is a solution to a comment you had about wanting to compute probabilities from binnings without computing the outcomes. For many data points and many outcomes, the effect can be quite huge in terms of computation time.

But it would require us to write every singe function twice, once for probabilities, once for _and_outcomes. Realisticially, how many methods actually benefit from this by becoming faster...?

We only need two generic methods:

function probabilities_and_outcomes(o::OutcomeSpace, x)
    probs = probabilities(o, x)
    return probs, outcomes(probs)
end

function probabilities_and_outcomes(o::CountBasedOutcomeSpace, x)
    cts, outs = counts_and_outcomes(o, x)
    probs = Probabilities(cts, outs)
    return probs, outcomes(probs)
end

Some methods override it, but we don't have that many outcome spaces, so I'm fine with a few lines of repeated code in those cases.

Datseris commented 6 months ago

Yes, this is a solution to a comment you had about wanting to compute probabilities from binnings without computing the outcomes. For many data points and many outcomes, the effect can be quite huge in terms of computation time

I am sure this was overstated, computing the bins has negligible performance cost. Morever, the bins are anyways computed by the way of computing the histogram. that's what the fasthist function does.

We only need two generic methods:

function probabilities_and_outcomes(o::OutcomeSpace, x)
    probs = probabilities(o, x)
    return probs, outcomes(probs)
end

function probabilities_and_outcomes(o::CountBasedOutcomeSpace, x)
    cts, outs = counts_and_outcomes(o, x)
    probs = Probabilities(cts, outs)
    return probs, outcomes(probs)
end

I don't understand why the second method exists. Shouldn't the probabilities function already cover whatever outcome space it gets?

Some methods override it, but we don't have that many outcome spaces, so I'm fine with a few lines of repeated code in those cases.

I still don't understand what is the status quo of current source code and there is no way I can go through 50+ file changes. What is the default function that we extend in the source? is it probabilities? or probabilities_or_outcomes? which methods ovverride it that you are referring above, so I can look at their source file specifically?

kahaaga commented 6 months ago

I don't understand why the second method exists. Shouldn't the probabilities function already cover whatever outcome space it gets?

It exists because Probabilities can wrap an array of any dimension N. If p isa Probabilities, then p.outcomes contains an N-tuple of outcomes. This means that the outcomes are not returned directly, but the tuple is.

Here, outcomes has a method that dispatches on Probabilities{T, 1} that ensures that outcomes are returned directly if working with 1D probabilities (which is always the case in this package, but not in CausalityTools).

kahaaga commented 6 months ago

Morever, the bins are anyways computed by the way of computing the histogram. that's what the fasthist function does.

The bins are computed when computing the histogram, but the decoded bins are not (which is what the user wants).

kahaaga commented 6 months ago

what is the status quo of current source code

The changes I have proposed above is implemented here now, regarding:

is it probabilities? or probabilities_or_outcomes? which methods ovverride it that you are referring above, so I can look at their source file specifically?

The files to look at is:

The remainder of the PR is fixing references, cross-references, exports and other things that Documenter complained about.

kahaaga commented 6 months ago

which methods override it that you are referring above

I'll have to log off for tonight, I can give a list of concrete methods tomorrow.

Datseris commented 6 months ago

@kahaaga I'm reviewing now and improving docs when possible. One thing I am not happy with is the naming of the dimensions. I don't like that they use x which is the symbol we use universally for input data. If anything they should use o.

Datseris commented 6 months ago

@kahaaga I am still not totally sure how the source code is set up. Here is my current understanding:

By default, all outcome spaces should extend counts_and_outcomes (or probabilities_and if they are not count based). If the outcomes do not come for free, then instead they can extend counts and then explicitly add another method for counts_and_outcomes that calls counts first and then decodes the outcomes. In this way, we in fact need only one generic fallback method:

function probabilities(o::OutcomeSpace, x)
    return probabilities_and_outcomes(o, x)[1]
end

Yet. that's not what I see in the source code. E.g., for ordinal pattertns we have:


function counts(est::OrdinalPatterns{m}, x) where m
    cts, πs = counts_and_symbols(est, x)
    outcomes = map(ω -> decode(est.encoding, ω), unique!(πs))
    return Counts(cts, outcomes)
end

so counts also computes the outcomes, while I think your whole purpose of doing this refactoring was to allow for the "faster" methods that do not compute outcomes.

Can you please confirm if my understanding is correct and you simply forgot about the ordinal patterns, or if my understanding is incorrect in which case then please explain to me...?

kahaaga commented 6 months ago

Can you please confirm if my understanding is correct and you simply forgot about the ordinal patterns, or if my understanding is incorrect in which case then please explain to me...?

You understanding is correct. I simply forgot about the ordinal patterns, or messed up the thought process while implementing it. For the ordinal outcome spaces it should be

EDIT: fixed the latter method

# Generic outcomes.
function counts(est::OrdinalPatterns{m}, x) where m
    cts, πs = counts_and_symbols(est, x)
    return Counts(cts, Outcome(1):1:Outcome(length(cts)))
end

# Decoded outcomes (expensive, so we override `counts_and_outcomes`).
function counts_and_outcomes(est::OrdinalPatterns{m}, x) where m
    cts, πs = counts_and_symbols(est, x)
    outcomes = map(ω -> decode(est.encoding, ω), unique!(πs))
    c = Counts(cts, outcomes)
    return c, outcomes(c)
end
kahaaga commented 6 months ago

your whole purpose of doing this refactoring was to allow for the "faster" methods that do not compute outcomes.

Yes, that is correct. I am doing some minor refactoring again to get everything aligned with your comment, which is how we should do it

Datseris commented 6 months ago

ok so I wait for your push on this branch, right?

kahaaga commented 6 months ago

ok so I wait for your push on this branch, right?

Yes, I'll make a few more commits here

kahaaga commented 6 months ago

I have to do some other things until tonight. What remains here that I can think of is:

kahaaga commented 6 months ago

@Datseris I'm done here for now.

Datseris commented 6 months ago

@kahaaga pretty printing is not yet implemented, right? I get


julia> probs, outs = probabilities_and_outcomes(PowerSpectrum(), x);

julia> probs
501-element Probabilities{Float64,1}:
 0.0029193965432375087
 0.0031064587636179856
 0.0018789482535108155
 0.0002387955608234389
 0.001257342758101825
kahaaga commented 6 months ago

julia> probs, outs = probabilities_and_outcomes(PowerSpectrum(), x);

julia> probs 501-element Probabilities{Float64,1}: 0.0029193965432375087 0.0031064587636179856 0.0018789482535108155 0.0002387955608234389 0.001257342758101825

Correct. Everything related to outcomes works, but pretty printing is not implemented. It will be based on the file that is now commented out called print_counts_probs.jl, so don't delete it for now. I'll deal with it in a separate PR.

kahaaga commented 6 months ago

I think this is fantastic. The tutorial is a bit rough around the edges, but I will improve it in a different PR.

Yes, we deal with anything that is not related to getting rid of DimensionalData and getting Probabilities/Counts to work in separate PRs.

Datseris commented 6 months ago

Okay, then let's merge this, I can work in a tutorial PR in parallel. There is one thing that I didn't fully undersatnf. What do these methods do:


function outcomes(c::Counts{<:Integer, N}) where N
    return map(i -> c.outcomes[i], tuple(1:N...))
end

function outcomes(c::Counts{<:Integer, N}, idxs) where N
    map(i -> c.outcomes[i], tuple(idxs...))
end

the first method, why doesn't it just returl c.outcomes?