Define input format for categorical forecasts

nikosbosse commented 7 months ago

As mentioned in #604, we would like to be able to score categorical/multiclass forecasts (or whatever the name should be, see #607. What should the expected format be?

Data.frame

Are we fine with the following? columns predicted, observed, predicted_class, with both observed and predicted_class being factors with the same levels.

Alternative names instead of predicted_class:

predicted_category
category
predicted_outcome

Vector format

seabbs commented 7 months ago

All seems reasonable. I have only come across these discussed as categorical forecasts but I don't have a strong feeling that that is the right way to go.

seabbs commented 7 months ago

Had a chat with ChatGPT. Point about categorical being a good overall name and multiclass being useful to distinguish between binary and non-binary seems like a good one.

Prompt

Multiclass forecasts and categorical forecasts are related concepts but not identical.

Answer

Multiclass Forecasts: These refer to prediction tasks where the outcome can belong to one of multiple classes. The key point is that each instance (or observation) is classified into exactly one out of several possible categories. Common in machine learning, examples include image classification where each image is categorized into one class (like 'cat', 'dog', 'bird').
Categorical Forecasts: This term is more general. It encompasses any forecasting where outcomes are categories rather than numerical values. This includes multiclass situations but can also refer to binary classification (where there are only two categories) or even scenarios with unordered categories (like predicting types of cuisine).

As for which is more commonly used, it depends on the context:

In general machine learning applications, multiclass classification is very common, especially in areas like computer vision and natural language processing.
In statistical modeling and some business applications, simpler categorical forecasts (like binary classification) are often more prevalent.

The choice between them typically depends on the specific requirements of the task at hand. In some scenarios, the distinction might not be significant, while in others, the specific nature of the categories can greatly influence the modeling approach and evaluation metrics used.

nikosbosse commented 7 months ago

ok sounds good. Then I suggest the following.

n: number of observations, N: number of possible categories of the outcome

The data.frame input format will be

observed: factor with N levels
predicted: numeric between 0 and 1
somename: factor with N unordered factor levels One forecast comprises N rows, each possible factor level must have a prediction and predictions must sum up to 1.

The vector/matric format will be

vector observed: factor of length n with N unordered factor levels
nXN matrix predicted, rows are observations, columns are categories. If n=1 this can also be a vector of length N.
somename factor of length N with N levels, representing the columns of predicted.

I also suggest to move the naming of somename to #607

Pinging @nickreich and @sbfnk in case you want to weigh in

nikosbosse commented 7 months ago

@nickreich just raised a good point: Do we want to enforce N rows for every forecast? Say you're predicting who wins the US presidency. You have 30 candidates, but you only assign a probability > 0 to 6 of them. Do you then have to have 24 rows with zeros?

I can see several options:

A: we enforce this strictly
- makes the format very clear
- we should probably give users a helper function that expands their data.frame and creates rows with a predicted value of zero for every missing category label
B: we don't enforce this
- allows users to save storage space + interact with the function without having to do any additional formatting
- can affect scoring in undesirable ways. Let say there were initially 30 candidates, but 10 dropped out and there were only 20 left at the time you made a forecast. But you prepared your data much earlier and now your factor has 30 levels, but in reality there were only 20 options. If you use the Brier score, then it will make a difference whether you had 20 or 30 levels to begin with.
- We could potentially address this by printing helpful messages / running some checks whether the data makes sense

Noting that in the vector/matrix paradigm we have some kind of implicit enforcement anyway: the prediction matrix has to have rectangular shape. (though in the above example, you'd end up with a nx20 matrix, even though your factor had 30 levels and then the function would have to decide whether to take its N from the number of factor levels or from the dimensions of the prediction).

I'm personally leaning slightly towards strict enforcement + helper function to get there from a more liberal format that omits rows with a predicted probability of 0. What do others think? Also pinging @elray1 in case you have thoughts

seabbs commented 7 months ago

I'm personally leaning slightly towards strict enforcement + helper function to get there from a more liberal format that omits rows with a predicted probability of 0.

Yes I think this makes sense. Potentially could run this for people within the as_forecast method but maybe not if its overly complicated.

epiforecasts / scoringutils

Define input format for categorical forecasts #608

Prompt

Answer