Closed kahaaga closed 6 months ago
Attention: 13 lines
in your changes are missing coverage. Please review.
Comparison is base (
ecd13a7
) 87.71% compared to head (9b0b0c2
) 88.73%.
Files | Patch % | Lines |
---|---|---|
src/core/outcome.jl | 50.00% | 6 Missing :warning: |
src/core/counts.jl | 92.30% | 3 Missing :warning: |
src/core/probabilities.jl | 93.18% | 3 Missing :warning: |
.../spatial/spatial_ordinal/SpatialOrdinalPatterns.jl | 83.33% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
i think in this PR we should also make it possible to initialize probabilities without outcomes. In this scenario the outcomes become just the integers corresponding to the probabilities, the same as eachindex(p)
. So that users can still use the interface with having alreayd probabilities from elsewhere.
i think in this PR we should also make it possible to initialize probabilities without outcomes. In this scenario the outcomes become just the integers corresponding to the probabilities, the same as eachindex(p). So that users can still use the interface with having alreayd probabilities from elsewhere.
Yep - I'm already working on it.
The nice thing about using integers as the outcomes when no outcomes are specified is that we don't actually need to allocate n_outcome
-length vectors for the outcomes on initialization of the Counts
or Probabilities
. We simply initialize them as ranges 1:n_outcomes
, which are collect
ed only when needed: either when printing, or in allprobabilities
, where we need to sort them (which should be quite rare).
This addresses the performance issue mentioned in a comment you made in #326 about initializing millions of instances of Probabilities
.
I've created a small wrapper type around the integers to simplify pretty printing (we want it to be clear when printing whether these outcomes are actual integers, or if they've been generically assigned). But lacking a clear extendable API as for AbstractArray
, it is a bit more tricky to extend the integers than it is to extend an array (not such which methods to extent to allow the necessary behavior). So I might go for just using integers anyways to reduce the work needed.
@Datseris, would it be okay to have the following behaviour?
counts(o::OutcomeSpace, x)
returns counts whose outcomes are the generic range 1:n_outcomes
(to save computation time)counts_and_outcomes(o::OutcomeSpace, x)
returns counts whose outcomes are explicitly computed from the data (slightly slower due to the extra mapping from data space to outcome space).The behaviour would be analogous for probabilities
/probabilities_and_outcomes
.
This way, the user can decide whether to use the fast or slower version, depending on whether they need the outcomes or not.
Except the pretty printing, I'm done here according to the specification in my latest comment. I'll hold off further work here until #336 is merged to avoid ballooning the merge conflict too much.
@Datseris This is ready for review (but should not be merged until my comment below is addressed). The diff is quite large, since using Documenter v1 now errors on what user to only give warnings during the doc build. I've fixed every warning and every error in this PR.
Comments
https://github.com/JuliaDynamics/doctheme/
. This should be updated when https://github.com/JuliaDynamics/doctheme/pull/3 is merged.Note: I put the multiscale stuff temporarily in a .txt
file, so that Documenter doesn't include it in the doc build.
@Datseris, would it be okay to have the following behaviour?
* `counts(o::OutcomeSpace, x)` returns counts whose outcomes are the generic range `1:n_outcomes` (to save computation time) * `counts_and_outcomes(o::OutcomeSpace, x)` returns counts whose outcomes are explicitly computed from the data (slightly slower due to the extra mapping from data space to outcome space).
The behaviour would be analogous for
probabilities
/probabilities_and_outcomes
.This way, the user can decide whether to use the fast or slower version, depending on whether they need the outcomes or not.
This would create a huge amount of code change though, right? What's the status quo now? probabilities/counts
always estimates the outcomes, and stores them in the returned type, right?
I guess your proposal would make some methods faster by skipping the outcomes. But it would require us to write every singe function twice, once for probabilities, once for _and_outcomes. Realisticially, how many methods actually benefit from this by becoming faster...?
What's the status quo now? probabilities/counts always estimates the outcomes, and stores them in the returned type, right?
The current situation in this PR is that probabilities
/counts
estimates outcomes if they are near-free to compute. If the outcomes are expensive to compute, e.g. for ValueBinning
, generic integer outcomes are assigned.
Currently, in order to guarantee that you get the proper outcomes, you have to use counts_and_outcomes
and probabilities_and_outcomes
. counts
/probabilities
may skip computing the outcomes if it is expensive to do so.
This would create a huge amount of code change though, right?
Compared to the changes we've made earlier, this wasn't too big of a change. Anyways: it's already done in this proposal.
I guess your proposal would make some methods faster by skipping the outcomes.
Yes, this is a solution to a comment you had about wanting to compute probabilities from binnings without computing the outcomes. For many data points and many outcomes, the effect can be quite huge in terms of computation time.
But it would require us to write every singe function twice, once for probabilities, once for _and_outcomes. Realisticially, how many methods actually benefit from this by becoming faster...?
We only need two generic methods:
function probabilities_and_outcomes(o::OutcomeSpace, x)
probs = probabilities(o, x)
return probs, outcomes(probs)
end
function probabilities_and_outcomes(o::CountBasedOutcomeSpace, x)
cts, outs = counts_and_outcomes(o, x)
probs = Probabilities(cts, outs)
return probs, outcomes(probs)
end
Some methods override it, but we don't have that many outcome spaces, so I'm fine with a few lines of repeated code in those cases.
Yes, this is a solution to a comment you had about wanting to compute probabilities from binnings without computing the outcomes. For many data points and many outcomes, the effect can be quite huge in terms of computation time
I am sure this was overstated, computing the bins has negligible performance cost. Morever, the bins are anyways computed by the way of computing the histogram. that's what the fasthist
function does.
We only need two generic methods:
function probabilities_and_outcomes(o::OutcomeSpace, x) probs = probabilities(o, x) return probs, outcomes(probs) end function probabilities_and_outcomes(o::CountBasedOutcomeSpace, x) cts, outs = counts_and_outcomes(o, x) probs = Probabilities(cts, outs) return probs, outcomes(probs) end
I don't understand why the second method exists. Shouldn't the probabilities
function already cover whatever outcome space it gets?
Some methods override it, but we don't have that many outcome spaces, so I'm fine with a few lines of repeated code in those cases.
I still don't understand what is the status quo of current source code and there is no way I can go through 50+ file changes. What is the default function that we extend in the source? is it probabilities
? or probabilities_or_outcomes
? which methods ovverride it that you are referring above, so I can look at their source file specifically?
I don't understand why the second method exists. Shouldn't the probabilities function already cover whatever outcome space it gets?
It exists because Probabilities
can wrap an array of any dimension N
. If p isa Probabilities
, then p.outcomes
contains an N
-tuple of outcomes. This means that the outcomes are not returned directly, but the tuple is.
Here, outcomes
has a method that dispatches on Probabilities{T, 1}
that ensures that outcomes are returned directly if working with 1D probabilities (which is always the case in this package, but not in CausalityTools).
Morever, the bins are anyways computed by the way of computing the histogram. that's what the fasthist function does.
The bins are computed when computing the histogram, but the decoded bins are not (which is what the user wants).
what is the status quo of current source code
The changes I have proposed above is implemented here now, regarding:
counts
/probabilities
/counts_and_outcomes
/probabilities_and_outcomes
is it probabilities? or probabilities_or_outcomes? which methods ovverride it that you are referring above, so I can look at their source file specifically?
The files to look at is:
core/counts.jl
core/probabilities.jl
outcome_spaces/
- small changes each fileThe remainder of the PR is fixing references, cross-references, exports and other things that Documenter complained about.
which methods override it that you are referring above
I'll have to log off for tonight, I can give a list of concrete methods tomorrow.
@kahaaga I'm reviewing now and improving docs when possible. One thing I am not happy with is the naming of the dimensions. I don't like that they use x
which is the symbol we use universally for input data. If anything they should use o
.
@kahaaga I am still not totally sure how the source code is set up. Here is my current understanding:
By default, all outcome spaces should extend counts_and_outcomes
(or probabilities_and
if they are not count based). If the outcomes do not come for free, then instead they can extend counts
and then explicitly add another method for counts_and_outcomes
that calls counts
first and then decodes the outcomes. In this way, we in fact need only one generic fallback method:
function probabilities(o::OutcomeSpace, x)
return probabilities_and_outcomes(o, x)[1]
end
Yet. that's not what I see in the source code. E.g., for ordinal pattertns we have:
function counts(est::OrdinalPatterns{m}, x) where m
cts, πs = counts_and_symbols(est, x)
outcomes = map(ω -> decode(est.encoding, ω), unique!(πs))
return Counts(cts, outcomes)
end
so counts also computes the outcomes, while I think your whole purpose of doing this refactoring was to allow for the "faster" methods that do not compute outcomes.
Can you please confirm if my understanding is correct and you simply forgot about the ordinal patterns, or if my understanding is incorrect in which case then please explain to me...?
Can you please confirm if my understanding is correct and you simply forgot about the ordinal patterns, or if my understanding is incorrect in which case then please explain to me...?
You understanding is correct. I simply forgot about the ordinal patterns, or messed up the thought process while implementing it. For the ordinal outcome spaces it should be
EDIT: fixed the latter method
# Generic outcomes.
function counts(est::OrdinalPatterns{m}, x) where m
cts, πs = counts_and_symbols(est, x)
return Counts(cts, Outcome(1):1:Outcome(length(cts)))
end
# Decoded outcomes (expensive, so we override `counts_and_outcomes`).
function counts_and_outcomes(est::OrdinalPatterns{m}, x) where m
cts, πs = counts_and_symbols(est, x)
outcomes = map(ω -> decode(est.encoding, ω), unique!(πs))
c = Counts(cts, outcomes)
return c, outcomes(c)
end
your whole purpose of doing this refactoring was to allow for the "faster" methods that do not compute outcomes.
Yes, that is correct. I am doing some minor refactoring again to get everything aligned with your comment, which is how we should do it
ok so I wait for your push on this branch, right?
ok so I wait for your push on this branch, right?
Yes, I'll make a few more commits here
I have to do some other things until tonight. What remains here that I can think of is:
counts
/counts_and_probabilities
and the same for the probabilities variants since my changes. EDIT: doneCounts
/Probabilities
that ensures that the number of outcomes along each dimension is the same as the size of the input array along that dimension (each count/probability must have a 1:1-mapping to an outcome).@Datseris I'm done here for now.
@kahaaga pretty printing is not yet implemented, right? I get
julia> probs, outs = probabilities_and_outcomes(PowerSpectrum(), x);
julia> probs
501-element Probabilities{Float64,1}:
0.0029193965432375087
0.0031064587636179856
0.0018789482535108155
0.0002387955608234389
0.001257342758101825
julia> probs, outs = probabilities_and_outcomes(PowerSpectrum(), x);
julia> probs 501-element Probabilities{Float64,1}: 0.0029193965432375087 0.0031064587636179856 0.0018789482535108155 0.0002387955608234389 0.001257342758101825
Correct. Everything related to outcomes works, but pretty printing is not implemented. It will be based on the file that is now commented out called print_counts_probs.jl
, so don't delete it for now. I'll deal with it in a separate PR.
I think this is fantastic. The tutorial is a bit rough around the edges, but I will improve it in a different PR.
Yes, we deal with anything that is not related to getting rid of DimensionalData and getting Probabilities
/Counts
to work in separate PRs.
Okay, then let's merge this, I can work in a tutorial PR in parallel. There is one thing that I didn't fully undersatnf. What do these methods do:
function outcomes(c::Counts{<:Integer, N}) where N
return map(i -> c.outcomes[i], tuple(1:N...))
end
function outcomes(c::Counts{<:Integer, N}, idxs) where N
map(i -> c.outcomes[i], tuple(idxs...))
end
the first method, why doesn't it just returl c.outcomes
?
Fixes #326.
Changes
outcomes
is now explicitly a field for bothCounts
andProbabilities
.counts_and_outcomes
andprobabilities_and_outcomes
returns a tuple of the counts/probabilities and the outcomes, for backwards compatibility (strictly speaking, this isn't necessary now, since we explicitly store the outcomes)Counts(cts, outs)
andProbabilities(cts, outs)
. The public constructor isCounts(cts)
andProbabilities(p)
, which generically names the outcomes:outcome1, outcome2, ...
for each dimension.TODO