nikosbosse commented 9 months ago

Replaces #476 and #401. Everyone ready for another round of naming discussions? 🙈

Here is a list of places that use the words scores, metrics, rules, score_names or something similar.

Functions

[ ] available_metrics() --> should be removed entirely
[ ] get_score_names() "Get Names Of The Scoring Rules That Were Used For Scoring"
[ ] validate_metrics() --> should probably be validate_scoring_rules()
[ ] apply_rules()
[ ] default scoring rules, e.g. rules_point() etc., + select_rules()

Function arguments

[ ] correlation(): "Correlation Between Metrics" --> Should be "Correlation between scores"

correlation <- function(scores,
                    metrics = NULL,
                    digits = NULL)

[ ] pairwise_comparison()

pairwise_comparison <- function(scores,
                            by = "model",
                            metric = "auto",
                            baseline = NULL,
                            ...)

[ ] add_pairwise_comparison()

add_pairwise_comparison <- function(scores,
                                by = NULL,
                                relative_skill_metric = "auto",
                                baseline = NULL)

[ ] score() methods, e.g.

score.forecast_binary <- function(data, metrics = rules_binary(), ...)

Other

score_names is the attribute that holds the names of the scores/scoring rules used for scoring in score(). (that's what's returned by get_score_names()

Proposal

I suggest the following:

we call the scoring functions "scoring rules". (see https://github.com/epiforecasts/scoringutils/issues/520)
we call the output of a scoring rule "score"
we call the name of the score "score name". In particular, "score name" would refer to the name of the column that holds the score. (I agree that "score name" is usually identical to the name of the scoring rule...)

This would result in the following function names/argument names:

get_score_names() - Returns the names of the scoring rules that were used for scoring - which correspond to the names of columns that hold the scores
validate_rules() ~validate_scoring_rules()~ - validates the scoring rules passed to score()
apply_rules() - applies the scoring rules inside score(). ~Could also rename to apply_scoring_rules()~
default scoring rules like rules_point() etc., + select_rules() to select them
correlation(scores, score_names, digits) computes correlations between scores (the output of the scoring rules). score_names denotes the column names of those scores for which a correlation should be computed.
pairwise_comparison(scores, by, score_name, baseline, ...). Computes pairwise comparisons between two models based on the scores they achieved.
add_pairwise_comparison(scores, by, score_name, baseline, ...)
score.forecast_binary(data, rules =, ....) etc. ~(or alternatively, score.forecast_binary(data, scoring_rules =, ....)~

Since we call the output of pairwise_comparison() something like wis_relative_skill, maybe the argument could also just be relative_skill

nikosbosse commented 9 months ago

@sbfnk in #600 you mentioned you didn't think the distinction between "score" (output of a scoring rule) and "scoring rule" (function to compute the score) made much sense. What's your current thinking on this? I think, for example, it doesn't make much sense to talk of correlations between different scoring rules. If we want to omit that distinction, then I guess it would make more sense to call everything a "score".

Pinging @nickreich, @elray1 for a pair of fresh eyes on the whole subject. We've been going in circles a bit in the past...

nickreich commented 9 months ago

I spent a few minutes trying to catch up on these discussions. I think @nikosbosse's suggestions above make pretty good sense. My one quibble is that the distinction between "scoring name" and "scoring rule" feels fuzzy. Like, couldn't you just enforce that the name of each scoring_rule function be clear enough that it could stand in for the score_name?

nikosbosse commented 9 months ago

@nickreich so a user would be able to pass in their own names. score() currently expects a named list of functions. Those names will then be used as the column names in the output. That's why the output has an attribute that tracks the names of the scoring rules used. I think in practice the names of the scores are identical to the names of the scoring rules, it's more like you wouldn't compute correlations between scoring rules, but between the scores they produced. Does that address what you thought about?

seabbs commented 9 months ago

Everyone ready for another round of naming discussions?

Round and round we go.

something_scoring_rules vs something_rules

As we are in a package called scoringutils do we need to name everything scoring this and that? There is some clarity benefit but it also makes all of the variable and argument names longer (in what is already a very verbose package). I vote for using rules as a shorthand for scoring rules in places where that is obvious and clearly documenting this choice.

score.forecast_binary(data, scoring_rules =, ....)

This for example makes me a tired man.

Something we have mentioned elsewhere is what happens when a scoring rule has multiple outputs. For example WIS decomposition. I think there is an argument its not a score and so should have a totally different name and there is an argument that we just want to make a distinction between the thing we use to get scores (the rule) and the scores themselves (as this can then include WIS decomposition etc.).

nikosbosse commented 9 months ago

I'm fine with using rules consistently. (EDIT: instead of scoring_rules) Then I think the remaining question is whether we should replace score_names by rules as well. This boils down to a choice between

Option A	Option B
`get_rules()`	`get_score_names()`
`correlation(scores, rules, digits)`	`correlation(scores, score_names, digits)`
`pairwise_comparison(scores, by, rule, ...)`	`pairwise_comparison(scores, by, score_name, ...)`
Attribute `rules` (names of scoring rules used)	Attribute `score_names` (column names of scores produced)

In a way, it makes more sense to compute a correlation between scores than between rules. But I also tend to agree with @nickreich (and I think @sbfnk) that adding another term to distinguish between the name of a score and the name of a scoring rule is maybe not all that helpful.

An additional option for the pairwise comparisons would be pairwise_comparison(scores, by, relative_skill, ...).

seabbs commented 9 months ago

I'm fine with using rules consistently.

This wasn't my suggestion (I was arguing we keep it where we have it to refer to scoring rules and continue to use score/score name for the output) but am happy if that is the way we go.

adding another term to distinguish between the name of a score and the name of a scoring rule is maybe not all that helpful.

😢 this was your idea in the first place...

nikosbosse commented 9 months ago

This wasn't my suggestion

I meant using rules instead of scoring_rules

😢 this was your idea in the first place...

I know 😢. And to me it makes sense - but apparently, it's confusing to others...

sbfnk commented 9 months ago

you mentioned you didn't think the distinction between "score" (output of a scoring rule) and "scoring rule" (function to compute the score) made much sense. What's your current thinking on this?

I didn't mean to say this, I meant to say the distinction between score_name and scoring_rule was potentially not very helpful.

sbfnk commented 9 months ago

In a way, it makes more sense to compute a correlation between scores than between rules. But I also tend to agree with @nickreich (and I think @sbfnk) that adding another term to distinguish between the name of a score and the name of a scoring rule is maybe not all that helpful.

But you are passing scores to correlation (the table holding the outcomes of applying scoring rules). And you're specifying the scores corresponding to which rules you're comparing. In this sense correlation(scores, rules, digits) makes sense I think.

I think there is an argument its not a score and so should have a totally different name and there is an argument that we just want to make a distinction between the thing we use to get scores (the rule)

I guess this is where metric as a more general concept than a scoring rule came from? But at the same time scoring rules are so central to the concept of forecast evaluation that we want to name them as such when they are scoring rules? Maybe metric wasn't so bad after all...

nikosbosse commented 8 months ago

As a wise man once said:

Maybe metric wasn't so bad after all...

This might come as a shock to all, but we (I..? 👀) might be overthinking this a bit. After some more overthinking, that's what I ended up with: It seems to me that

"scoring rule" as a term is a bit more precise
"metric" is a simpler/better name for a function argument as it is shorter and more versatile
the imprecision of the term "metrics" is probably not a problem and I think we can make fairly clear what we mean in the documentation

"metric" can potentially mean different things:

"metric" is equivalent to "scoring rule" in the sense that you "pass a list of metrics" to score(). I.e. "metrics" can mean the functions that are used for scoring
"metric" can to some extent also denote the output of a scoring function. I.e. "you compute several metrics". In that sense it would be equivalent to a "score".

I think in practice this is probably not much of a problem as the meaning should be quite clear from the context. If it is not, we could make it clearer by talking about "metrics and scoring rules" when we mean the functions. We could also aim to generally use the term "score" when we talk about the output of a metric/scoring rule to avoid that confusion.

I therefore now suggest the following function and argument names:

apply_metrics() function that applies the metrics to a data.table
metrics: attribute that holds the names of the metrics used during scoring
get_metrics() access the contents of the metrics attribute
validate_metrics() function to check whether the list of metrics to be passed to score() meets requirements
metrics_point() etc.: names of the default metrics for each forecast type
select_metrics() helper function to manipulate that default list of forecasts
correlation(scores, metrics = NULL, digits = NULL)
pairwise_comparison(scores, by = "model", metric, baseline = NULL, ...)
add_pairwise_comparison(scores, by = "model", metric, baseline = NULL, ...)
score.forecast_binary(data, metrics = metrics_binary(), ...)

What do you think? To reduce emotional stress induced by this discussion thread, here is a baby seal:

seabbs commented 8 months ago

Agree.

nikosbosse commented 8 months ago

Spinning out a few issues:

707
709

I think these should cover everything

epiforecasts / scoringutils

Implement Consistent naming of scoring rules/metrics #610

Functions

Function arguments

Other

Proposal

707

709