Closed Datseris closed 5 months ago
Now that we have finally landed 3.0, we should strive for minimal PRs (one feature, one bug fix, one change, etc) to simplify reviewing. Therefore: deprecating should be dealt with in a separate PR. Perhaps you could open an issue to discuss a potential deprecation there?
This should happen in this PR I think, it results in 0 additional file changes. It doesn't make sense to make 1 PR to change all
and another to deprecate it... Just imagine that this PR deprecates all
, it doesn't change it.
This should happen in this PR I think, it results in 0 additional file changes. It doesn't make sense to make 1 PR to change all and another to deprecate it... Just imagine that this PR deprecates all, it doesn't change it.
Sure, we can do that.
But before deprecating, I just need to understand why did we introduce the all
keyword to begin with, which I'm struggling with. When using allprobabilities
and probabilities
, we just compare the observed outcomes to all possible outcomes? We do the same for both cases?
Ah, I think I remember now. Some ProbabilitiesEstimator
s have the property that the probabilities assigned depend on the number of outcomes considered. So using probabilities
and allprobabilities
will yield different results.
See the comment in the BayesianRegularization
source code:
# We need to implement `probabilities_and_outcomes` and `allprobabilities_and_outcomes` separately,
# because the number of elements in the outcome space determines the factor `A`, since
# A = sum(aₖ). Explicitly modelling the entire outcome space, instead of considering
# only the observed outcomes, will therefore affect the estimated probabilities.
function probabilities_and_outcomes(est::BayesianRegularization, outcomemodel::OutcomeSpace, x)
...
end
Since these estimators may assign different probabilities (and hence a different number of observed outcomes, in the sense that they get assigned a non-zero probability) depending on whether allprobabilities
or probabilities
is used, the all
keyword is a necessary parameter.
Right, yes. But isn't this all wrong in the first place? missing_outcomes
shouldn't be about the probabilities, but the counts. It should only be compatible with count based outcome spaces and use counts
. Whether an outcome exists in the data or not is not a matter of regularization, at least not for discrete outcome spaces.
Given what you said I think the implementation of missing_outcomes
is incorrect and needs to be fixed by using only count based outcome spaces and the counts
function.
For non-count outcome spaces it can use probabilities
internally, but not probability estimators per se.
Right, yes. But isn't this all wrong in the first place? missing_outcomes shouldn't be about the probabilities, but the counts. It should only be compatible with count based outcome spaces and use counts. Whether an outcome exists in the data or not is not a matter of regularization, at least not for discrete outcome spaces. Given what you said I think the implementation of missing_outcomes is incorrect and needs to be fixed by using only count based outcome spaces and the counts function. For non-count outcome spaces it can use probabilities internally, but not probability estimators per se.
Without having done a deep dive into the literature, I think some of this is new territory.
I think the correct approach here would be to have two separate functions:
missing_outcomes(::OutcomeSpace, x)
. Counts the number of missing outcomes, in terms of whether they occur in the encoded data or not.missing_probabilities(::ProbabilitiesEstimator, ::OutcomeSpace, x
. Counts the number of missing missing, in terms of whether the outcome gets assigned a zero-probability given the ProbabilitiesEstimator
. I don't think this has been done before, but it should definitely be possible to do. We can mention it as one of the ways this software can accelerate research I the paper, evenAlright, I will add this deprecation in this PR because this PR started dealing with the ambiguity of all
. The probabilities version will always use the probabilties
, no more allprobabilities
.
Alright, I will add this deprecation in this PR because this PR started dealing with the ambiguity of all. The probabilities version will always use the probabilties, no more allprobabilities.
Ok!
Attention: 1 lines
in your changes are missing coverage. Please review.
Comparison is base (
dc0237b
) 88.68% compared to head (2e2ab0f
) 88.63%.
Files | Patch % | Lines |
---|---|---|
src/core/probabilities.jl | 91.66% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This PR simplifies the docstring and source codes of missing dispersion patterns and missing outcomes.
It also sets the
all
keyword ofmissing_outomces
tofalse
. Why would this betrue
? It is computationally more expensive and leads to the same outcome. Furthermore, I would even argue, why wouldall
ever be used? This keyword appears useless at first glance. Perhaps we should deprecate it?