epiforecasts / scoringutils

Utilities for Scoring and Assessing Predictions
https://epiforecasts.io/scoringutils/
Other
48 stars 21 forks source link

Implement Consistent naming of scoring rules/metrics #610

Closed nikosbosse closed 8 months ago

nikosbosse commented 9 months ago

Replaces #476 and #401. Everyone ready for another round of naming discussions? 🙈

Here is a list of places that use the words scores, metrics, rules, score_names or something similar.

Functions

Function arguments

Other

Proposal

I suggest the following:

This would result in the following function names/argument names:

Since we call the output of pairwise_comparison() something like wis_relative_skill, maybe the argument could also just be relative_skill

nikosbosse commented 9 months ago

@sbfnk in #600 you mentioned you didn't think the distinction between "score" (output of a scoring rule) and "scoring rule" (function to compute the score) made much sense. What's your current thinking on this? I think, for example, it doesn't make much sense to talk of correlations between different scoring rules. If we want to omit that distinction, then I guess it would make more sense to call everything a "score".

Pinging @nickreich, @elray1 for a pair of fresh eyes on the whole subject. We've been going in circles a bit in the past...

nickreich commented 9 months ago

I spent a few minutes trying to catch up on these discussions. I think @nikosbosse's suggestions above make pretty good sense. My one quibble is that the distinction between "scoring name" and "scoring rule" feels fuzzy. Like, couldn't you just enforce that the name of each scoring_rule function be clear enough that it could stand in for the score_name?

nikosbosse commented 9 months ago

@nickreich so a user would be able to pass in their own names. score() currently expects a named list of functions. Those names will then be used as the column names in the output. That's why the output has an attribute that tracks the names of the scoring rules used. I think in practice the names of the scores are identical to the names of the scoring rules, it's more like you wouldn't compute correlations between scoring rules, but between the scores they produced. Does that address what you thought about?

seabbs commented 9 months ago

Everyone ready for another round of naming discussions?

Round and round we go.

something_scoring_rules vs something_rules

As we are in a package called scoringutils do we need to name everything scoring this and that? There is some clarity benefit but it also makes all of the variable and argument names longer (in what is already a very verbose package). I vote for using rules as a shorthand for scoring rules in places where that is obvious and clearly documenting this choice.

score.forecast_binary(data, scoring_rules =, ....)

This for example makes me a tired man.

Something we have mentioned elsewhere is what happens when a scoring rule has multiple outputs. For example WIS decomposition. I think there is an argument its not a score and so should have a totally different name and there is an argument that we just want to make a distinction between the thing we use to get scores (the rule) and the scores themselves (as this can then include WIS decomposition etc.).

nikosbosse commented 9 months ago

I'm fine with using rules consistently. (EDIT: instead of scoring_rules) Then I think the remaining question is whether we should replace score_names by rules as well. This boils down to a choice between

Option A Option B
get_rules() get_score_names()
correlation(scores, rules, digits) correlation(scores, score_names, digits)
pairwise_comparison(scores, by, rule, ...) pairwise_comparison(scores, by, score_name, ...)
Attribute rules (names of scoring rules used) Attribute score_names (column names of scores produced)

In a way, it makes more sense to compute a correlation between scores than between rules. But I also tend to agree with @nickreich (and I think @sbfnk) that adding another term to distinguish between the name of a score and the name of a scoring rule is maybe not all that helpful.

An additional option for the pairwise comparisons would be pairwise_comparison(scores, by, relative_skill, ...).

seabbs commented 9 months ago

I'm fine with using rules consistently.

This wasn't my suggestion (I was arguing we keep it where we have it to refer to scoring rules and continue to use score/score name for the output) but am happy if that is the way we go.

adding another term to distinguish between the name of a score and the name of a scoring rule is maybe not all that helpful.

😢 this was your idea in the first place...

nikosbosse commented 9 months ago

This wasn't my suggestion

I meant using rules instead of scoring_rules

😢 this was your idea in the first place...

I know 😢. And to me it makes sense - but apparently, it's confusing to others...

sbfnk commented 9 months ago

you mentioned you didn't think the distinction between "score" (output of a scoring rule) and "scoring rule" (function to compute the score) made much sense. What's your current thinking on this?

I didn't mean to say this, I meant to say the distinction between score_name and scoring_rule was potentially not very helpful.

sbfnk commented 9 months ago

In a way, it makes more sense to compute a correlation between scores than between rules. But I also tend to agree with @nickreich (and I think @sbfnk) that adding another term to distinguish between the name of a score and the name of a scoring rule is maybe not all that helpful.

But you are passing scores to correlation (the table holding the outcomes of applying scoring rules). And you're specifying the scores corresponding to which rules you're comparing. In this sense correlation(scores, rules, digits) makes sense I think.

I think there is an argument its not a score and so should have a totally different name and there is an argument that we just want to make a distinction between the thing we use to get scores (the rule)

I guess this is where metric as a more general concept than a scoring rule came from? But at the same time scoring rules are so central to the concept of forecast evaluation that we want to name them as such when they are scoring rules? Maybe metric wasn't so bad after all...

nikosbosse commented 8 months ago

As a wise man once said:

Maybe metric wasn't so bad after all...

This might come as a shock to all, but we (I..? 👀) might be overthinking this a bit. After some more overthinking, that's what I ended up with: It seems to me that

"metric" can potentially mean different things:

I think in practice this is probably not much of a problem as the meaning should be quite clear from the context. If it is not, we could make it clearer by talking about "metrics and scoring rules" when we mean the functions. We could also aim to generally use the term "score" when we talk about the output of a metric/scoring rule to avoid that confusion.

I therefore now suggest the following function and argument names:

What do you think? To reduce emotional stress induced by this discussion thread, here is a baby seal:

seabbs commented 8 months ago

Agree.

nikosbosse commented 8 months ago

Spinning out a few issues: