name, aliases cleanup for measures

ablaom commented 3 years ago

tl;dr: see proposals A, B, C below.

I realize that measures is due for a thorough review and expansion (see, in particular https://github.com/alan-turing-institute/MLJBase.jl/issues/299) but this is a moving target (see eg, the very substantial https://github.com/alan-turing-institute/MLJBase.jl/pull/430) and I would like to fix a couple of minor but annoying things before things run away from us:

The main issue is a sad lack of consistency about the name attribute of a measure and a confusion about what it precisely represents. This makes programatic selection of measures buggy.

(name = area_under_curve, ...)
 (name = accuracy, ...)
 (name = balanced_accuracy, ...)
 (name = cross_entropy, ...)
 (name = FScore, ...)
 (name = false_discovery_rate, ...)
 (name = false_negative, ...)
 (name = false_negative_rate, ...)
 (name = false_positive, ...)
 (name = false_positive_rate, ...)
 (name = l1, ...)
 (name = l2, ...)
 (name = log_cosh, ...)
 (name = mae, ...)
 (name = mape, ...)
 (name = matthews_correlation, ...)
 (name = misclassification_rate, ...)
 (name = negative_predictive_value, ...)
 ⋮
 (name = L1HingeLoss(), ...)
 (name = L2HingeLoss(), ...)
 (name = L2MarginLoss(), ...)
 (name = LogitMarginLoss(), ...)
 (name = ModifiedHuberLoss(), ...)
 (name = PerceptronLoss(), ...)
 (name = SigmoidLoss(), ...)
 (name = SmoothedL1HingeLoss(), ...)
 (name = ZeroOneLoss(), ...)
 (name = HuberLoss(), ...)
 (name = L1EpsilonInsLoss(), ...)
 (name = L2EpsilonInsLoss(), ...)
 (name = LPDistLoss(), ...)
 (name = LogitDistLoss(), ...)
 (name = PeriodicLoss(), ...)
 (name = QuantileLoss(), ...)
 (name = confusion_matrix, ...)

Note we already have the docstring trait, which has (mostly) included a list of aliases for common instances of a measure. Recall that a measure is a type and that some measures have fields; for example CrossEntropy has eps, the cutoff threshold to prevent NaN.

In MLJ, models also have a name attribute and this is the string version of the model type; it is not literally the model type because model code is loaded on demand only. So far, we load all measures into scope. However, this might change in the future; think of measures for probabilistic predictions created by MCMC. My first proposal is:

[x] A. Rather than assigning the name attribute for each measure, this should just be a "derived" trait that just returns the string version of the type name (as for models), as in "AUC", "FScore", "CrossEntropy" (better would be "LogLoss"; see (https://github.com/alan-turing-institute/MLJBase.jl/issues/299#issuecomment-721520032), "ZeroOneLoss", etc.

Next, rather than throwing aliases for default instances into the doc-string, we should:

[x] B. Add an new trait called aliases or instance_aliases; this returns a vector of strings, whose corresponding Symbol expressions can be evaluated when relevant code has been loaded. For example TruePositiveRate gets the value ["true_positive_rate", "truepositive_rate", "tpr", "sensitivity", "recall", "hit_rate"]. This trait would be listed second in the "registry" (what you print out with measures()) to maximise discoverability.

The final clean-up item would be:

[x] C. Ensure that every measure has a default keyword constructor with defaults for all arguments. So FScore() should just work; it currently doesn't.

These changes are technically breaking and would require a new release, but I can't imagine they would not be that disruptive.

Thoughts anyone?

edit

And I forgot:

[x] D. Make sure all current measure type names are exported, as well as their aliases.

ablaom commented 3 years ago

cc @ven-k @OkonSamuel @tlienart @azev77

ablaom commented 3 years ago

[x] Also: ~~import LossFunctions methods instead using them, and re-purpose~~ make LossFunction instances callable to ~~the names to~~ give them same behaviour as the other measures. So, for example, I can do ZeroOneLoss()(yhat, y) where yhat is a vector of UnivariateFinite distributions (rather than floats, as in the LossFunctions API).

ablaom commented 3 years ago

Get rid of beta as type parameter in FScore.

tlienart commented 3 years ago

For what it's worth I just read through your proposal and it seems very reasonable to me; I think (B) in particular is very useful to make it more user friendly :+1:

JuliaAI / MLJBase.jl

name, aliases cleanup for measures #450