JuliaAI / MLJBase.jl

Core functionality for the MLJ machine learning framework
MIT License
160 stars 45 forks source link

name, aliases cleanup for measures #450

Closed ablaom closed 3 years ago

ablaom commented 3 years ago

tl;dr: see proposals A, B, C below.

I realize that measures is due for a thorough review and expansion (see, in particular https://github.com/alan-turing-institute/MLJBase.jl/issues/299) but this is a moving target (see eg, the very substantial https://github.com/alan-turing-institute/MLJBase.jl/pull/430) and I would like to fix a couple of minor but annoying things before things run away from us:

The main issue is a sad lack of consistency about the name attribute of a measure and a confusion about what it precisely represents. This makes programatic selection of measures buggy.

(name = area_under_curve, ...)
 (name = accuracy, ...)
 (name = balanced_accuracy, ...)
 (name = cross_entropy, ...)
 (name = FScore, ...)
 (name = false_discovery_rate, ...)
 (name = false_negative, ...)
 (name = false_negative_rate, ...)
 (name = false_positive, ...)
 (name = false_positive_rate, ...)
 (name = l1, ...)
 (name = l2, ...)
 (name = log_cosh, ...)
 (name = mae, ...)
 (name = mape, ...)
 (name = matthews_correlation, ...)
 (name = misclassification_rate, ...)
 (name = negative_predictive_value, ...)
 ⋮
 (name = L1HingeLoss(), ...)
 (name = L2HingeLoss(), ...)
 (name = L2MarginLoss(), ...)
 (name = LogitMarginLoss(), ...)
 (name = ModifiedHuberLoss(), ...)
 (name = PerceptronLoss(), ...)
 (name = SigmoidLoss(), ...)
 (name = SmoothedL1HingeLoss(), ...)
 (name = ZeroOneLoss(), ...)
 (name = HuberLoss(), ...)
 (name = L1EpsilonInsLoss(), ...)
 (name = L2EpsilonInsLoss(), ...)
 (name = LPDistLoss(), ...)
 (name = LogitDistLoss(), ...)
 (name = PeriodicLoss(), ...)
 (name = QuantileLoss(), ...)
 (name = confusion_matrix, ...)

Note we already have the docstring trait, which has (mostly) included a list of aliases for common instances of a measure. Recall that a measure is a type and that some measures have fields; for example CrossEntropy has eps, the cutoff threshold to prevent NaN.

In MLJ, models also have a name attribute and this is the string version of the model type; it is not literally the model type because model code is loaded on demand only. So far, we load all measures into scope. However, this might change in the future; think of measures for probabilistic predictions created by MCMC. My first proposal is:

Next, rather than throwing aliases for default instances into the doc-string, we should:

The final clean-up item would be:

These changes are technically breaking and would require a new release, but I can't imagine they would not be that disruptive.

Thoughts anyone?

edit

And I forgot:

ablaom commented 3 years ago

cc @ven-k @OkonSamuel @tlienart @azev77

ablaom commented 3 years ago
ablaom commented 3 years ago

Get rid of beta as type parameter in FScore.

tlienart commented 3 years ago

For what it's worth I just read through your proposal and it seems very reasonable to me; I think (B) in particular is very useful to make it more user friendly :+1: