Add baseline metrics to trainer results

Datasets are almost never perfectly balanced. That means that an impressive metric, let's say 80% binary classification accuracy, could be worse than selecting always the most common category. Therefore, I would like to automatically calculate the baseline metrics, for example:

null accuracy
distribution of items according to categories
baseline metric for ranking of a random order

I think ideally, I would be able to read the baseline metrics as easily as the training data metrics. I think it should be also promoted as during the currenty AI summer I see many developers quickly build AI-enabled apps without any or much validation, i.e. whatever GPT models says is enough without validation or even real prompt engineering with comparison to ground truth; whatever Model builder says is the final validation and so on.

I think this would be in line with promoting fairness values. Equally incorrect results for everyone is not fair. Model Builder could probably even show graphically how much better model performs on test dataset than random or null accuracy.

First I think it would be important to record what is the ideal way to evaluate metrics. For example, comparison to null accuracy would probably be good for binary classification but not others. For ranking, there seems to be a few options. Implementation is probably not terribly difficult.

Decide baseline metrics for each ML task
Decide programming interface and where to calculate the metrics
Implementation

dotnet / machinelearning

Add baseline metrics to trainer results #6796