Closed nikosbosse closed 8 months ago
@sbfnk in #600 you mentioned you didn't think the distinction between "score" (output of a scoring rule) and "scoring rule" (function to compute the score) made much sense. What's your current thinking on this? I think, for example, it doesn't make much sense to talk of correlations between different scoring rules. If we want to omit that distinction, then I guess it would make more sense to call everything a "score".
Pinging @nickreich, @elray1 for a pair of fresh eyes on the whole subject. We've been going in circles a bit in the past...
I spent a few minutes trying to catch up on these discussions. I think @nikosbosse's suggestions above make pretty good sense. My one quibble is that the distinction between "scoring name" and "scoring rule" feels fuzzy. Like, couldn't you just enforce that the name of each scoring_rule
function be clear enough that it could stand in for the score_name
?
@nickreich so a user would be able to pass in their own names. score()
currently expects a named list of functions. Those names will then be used as the column names in the output. That's why the output has an attribute that tracks the names of the scoring rules used.
I think in practice the names of the scores are identical to the names of the scoring rules, it's more like you wouldn't compute correlations between scoring rules, but between the scores they produced. Does that address what you thought about?
Everyone ready for another round of naming discussions?
Round and round we go.
something_scoring_rules
vssomething_rules
As we are in a package called scoringutils
do we need to name everything scoring this and that? There is some clarity benefit but it also makes all of the variable and argument names longer (in what is already a very verbose package). I vote for using rules
as a shorthand for scoring rules
in places where that is obvious and clearly documenting this choice.
score.forecast_binary(data, scoring_rules =, ....)
This for example makes me a tired man.
Something we have mentioned elsewhere is what happens when a scoring rule has multiple outputs. For example WIS decomposition. I think there is an argument its not a score and so should have a totally different name and there is an argument that we just want to make a distinction between the thing we use to get scores (the rule) and the scores themselves (as this can then include WIS decomposition etc.).
I'm fine with using rules
consistently. (EDIT: instead of scoring_rules
)
Then I think the remaining question is whether we should replace score_names
by rules
as well. This boils down to a choice between
Option A | Option B |
---|---|
get_rules() |
get_score_names() |
correlation(scores, rules, digits) |
correlation(scores, score_names, digits) |
pairwise_comparison(scores, by, rule, ...) |
pairwise_comparison(scores, by, score_name, ...) |
Attribute rules (names of scoring rules used) |
Attribute score_names (column names of scores produced) |
In a way, it makes more sense to compute a correlation between scores than between rules. But I also tend to agree with @nickreich (and I think @sbfnk) that adding another term to distinguish between the name of a score and the name of a scoring rule is maybe not all that helpful.
An additional option for the pairwise comparisons would be pairwise_comparison(scores, by, relative_skill, ...)
.
I'm fine with using rules consistently.
This wasn't my suggestion (I was arguing we keep it where we have it to refer to scoring rules and continue to use score/score name for the output) but am happy if that is the way we go.
adding another term to distinguish between the name of a score and the name of a scoring rule is maybe not all that helpful.
😢 this was your idea in the first place...
This wasn't my suggestion
I meant using rules
instead of scoring_rules
😢 this was your idea in the first place...
I know 😢. And to me it makes sense - but apparently, it's confusing to others...
you mentioned you didn't think the distinction between "score" (output of a scoring rule) and "scoring rule" (function to compute the score) made much sense. What's your current thinking on this?
I didn't mean to say this, I meant to say the distinction between score_name
and scoring_rule
was potentially not very helpful.
In a way, it makes more sense to compute a correlation between scores than between rules. But I also tend to agree with @nickreich (and I think @sbfnk) that adding another term to distinguish between the name of a score and the name of a scoring rule is maybe not all that helpful.
But you are passing scores
to correlation
(the table holding the outcomes of applying scoring rules). And you're specifying the scores corresponding to which rules you're comparing. In this sense correlation(scores, rules, digits)
makes sense I think.
I think there is an argument its not a score and so should have a totally different name and there is an argument that we just want to make a distinction between the thing we use to get scores (the rule)
I guess this is where metric
as a more general concept than a scoring rule came from? But at the same time scoring rules are so central to the concept of forecast evaluation that we want to name them as such when they are scoring rules? Maybe metric
wasn't so bad after all...
As a wise man once said:
Maybe metric wasn't so bad after all...
This might come as a shock to all, but we (I..? 👀) might be overthinking this a bit. After some more overthinking, that's what I ended up with: It seems to me that
"metric" can potentially mean different things:
score()
. I.e. "metrics" can mean the functions that are used for scoringI think in practice this is probably not much of a problem as the meaning should be quite clear from the context. If it is not, we could make it clearer by talking about "metrics and scoring rules" when we mean the functions. We could also aim to generally use the term "score" when we talk about the output of a metric/scoring rule to avoid that confusion.
I therefore now suggest the following function and argument names:
apply_metrics()
function that applies the metrics to a data.tablemetrics
: attribute that holds the names of the metrics used during scoringget_metrics()
access the contents of the metrics
attributevalidate_metrics()
function to check whether the list of metrics to be passed to score()
meets requirementsmetrics_point()
etc.: names of the default metrics for each forecast typeselect_metrics()
helper function to manipulate that default list of forecastscorrelation(scores, metrics = NULL, digits = NULL)
pairwise_comparison(scores, by = "model", metric, baseline = NULL, ...)
add_pairwise_comparison(scores, by = "model", metric, baseline = NULL, ...)
score.forecast_binary(data, metrics = metrics_binary(), ...)
What do you think? To reduce emotional stress induced by this discussion thread, here is a baby seal:
Agree.
Spinning out a few issues:
I think these should cover everything
Replaces #476 and #401. Everyone ready for another round of naming discussions? 🙈
Here is a list of places that use the words
scores
,metrics
,rules
,score_names
or something similar.Functions
available_metrics()
--> should be removed entirelyget_score_names()
"Get Names Of The Scoring Rules That Were Used For Scoring"validate_metrics()
--> should probably bevalidate_scoring_rules()
apply_rules()
rules_point()
etc., +select_rules()
Function arguments
[ ]
correlation()
: "Correlation Between Metrics" --> Should be "Correlation between scores"[ ]
pairwise_comparison()
[ ]
add_pairwise_comparison()
[ ]
score()
methods, e.g.Other
score_names
is the attribute that holds the names of the scores/scoring rules used for scoring inscore()
. (that's what's returned byget_score_names()
Proposal
I suggest the following:
This would result in the following function names/argument names:
get_score_names()
- Returns the names of the scoring rules that were used for scoring - which correspond to the names of columns that hold the scoresvalidate_rules()
~validate_scoring_rules()
~ - validates the scoring rules passed toscore()
apply_rules()
- applies the scoring rules insidescore()
. ~Could also rename toapply_scoring_rules()
~default scoring rules like
rules_point()
etc., +select_rules()
to select themcorrelation(scores, score_names, digits)
computes correlations between scores (the output of the scoring rules).score_names
denotes the column names of those scores for which a correlation should be computed.pairwise_comparison(scores, by, score_name, baseline, ...)
. Computes pairwise comparisons between two models based on the scores they achieved.add_pairwise_comparison(scores, by, score_name, baseline, ...)
score.forecast_binary(data, rules =, ....)
etc. ~(or alternatively,score.forecast_binary(data, scoring_rules =, ....)
~Since we call the output of
pairwise_comparison()
something likewis_relative_skill
, maybe the argument could also just berelative_skill