Simplify MDP and missing outcomes

JuliaDynamics / ComplexityMeasures.jl

Estimators for probabilities, entropies, and other complexity measures derived from data in the context of nonlinear dynamics and complex systems

MIT License

48 stars 11 forks source link

Simplify MDP and missing outcomes #386

Closed Datseris closed 5 months ago

Datseris commented 5 months ago

This PR simplifies the docstring and source codes of missing dispersion patterns and missing outcomes.

It also sets the all keyword of missing_outomces to false. Why would this be true? It is computationally more expensive and leads to the same outcome. Furthermore, I would even argue, why would all ever be used? This keyword appears useless at first glance. Perhaps we should deprecate it?

Datseris commented 5 months ago

Now that we have finally landed 3.0, we should strive for minimal PRs (one feature, one bug fix, one change, etc) to simplify reviewing. Therefore: deprecating should be dealt with in a separate PR. Perhaps you could open an issue to discuss a potential deprecation there?

This should happen in this PR I think, it results in 0 additional file changes. It doesn't make sense to make 1 PR to change all and another to deprecate it... Just imagine that this PR deprecates all, it doesn't change it.

kahaaga commented 5 months ago

This should happen in this PR I think, it results in 0 additional file changes. It doesn't make sense to make 1 PR to change all and another to deprecate it... Just imagine that this PR deprecates all, it doesn't change it.

Sure, we can do that.

But before deprecating, I just need to understand why did we introduce the all keyword to begin with, which I'm struggling with. When using allprobabilities and probabilities, we just compare the observed outcomes to all possible outcomes? We do the same for both cases?

kahaaga commented 5 months ago

Ah, I think I remember now. Some ProbabilitiesEstimators have the property that the probabilities assigned depend on the number of outcomes considered. So using probabilities and allprobabilities will yield different results.

See the comment in the BayesianRegularization source code:

# We need to implement `probabilities_and_outcomes` and `allprobabilities_and_outcomes` separately,
# because the number of elements in the outcome space determines the factor `A`, since
# A = sum(aₖ). Explicitly modelling the entire outcome space, instead of considering
# only the observed outcomes, will therefore affect the estimated probabilities.
function probabilities_and_outcomes(est::BayesianRegularization, outcomemodel::OutcomeSpace, x)
    ... 
end

Since these estimators may assign different probabilities (and hence a different number of observed outcomes, in the sense that they get assigned a non-zero probability) depending on whether allprobabilities or probabilities is used, the all keyword is a necessary parameter.

Datseris commented 5 months ago

Right, yes. But isn't this all wrong in the first place? missing_outcomes shouldn't be about the probabilities, but the counts. It should only be compatible with count based outcome spaces and use counts. Whether an outcome exists in the data or not is not a matter of regularization, at least not for discrete outcome spaces.

Given what you said I think the implementation of missing_outcomes is incorrect and needs to be fixed by using only count based outcome spaces and the counts function.

For non-count outcome spaces it can use probabilities internally, but not probability estimators per se.

kahaaga commented 5 months ago

Right, yes. But isn't this all wrong in the first place? missing_outcomes shouldn't be about the probabilities, but the counts. It should only be compatible with count based outcome spaces and use counts. Whether an outcome exists in the data or not is not a matter of regularization, at least not for discrete outcome spaces. Given what you said I think the implementation of missing_outcomes is incorrect and needs to be fixed by using only count based outcome spaces and the counts function. For non-count outcome spaces it can use probabilities internally, but not probability estimators per se.

Without having done a deep dive into the literature, I think some of this is new territory.

I think the correct approach here would be to have two separate functions:

missing_outcomes(::OutcomeSpace, x). Counts the number of missing outcomes, in terms of whether they occur in the encoded data or not.
missing_probabilities(::ProbabilitiesEstimator, ::OutcomeSpace, x. Counts the number of missing missing, in terms of whether the outcome gets assigned a zero-probability given the ProbabilitiesEstimator. I don't think this has been done before, but it should definitely be possible to do. We can mention it as one of the ways this software can accelerate research I the paper, even

Datseris commented 5 months ago

Alright, I will add this deprecation in this PR because this PR started dealing with the ambiguity of all. The probabilities version will always use the probabilties, no more allprobabilities.

kahaaga commented 5 months ago

Alright, I will add this deprecation in this PR because this PR started dealing with the ambiguity of all. The probabilities version will always use the probabilties, no more allprobabilities.

Ok!

codecov[bot] commented 5 months ago

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (dc0237b) 88.68% compared to head (2e2ab0f) 88.63%.

Files	Patch %	Lines
src/core/probabilities.jl	91.66%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #386 +/- ## ========================================== - Coverage 88.68% 88.63% -0.05% ========================================== Files 80 80 Lines 2359 2358 -1 ========================================== - Hits 2092 2090 -2 - Misses 267 268 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.