JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.8k stars 157 forks source link

Taking loss functions seriously #450

Closed juliohm closed 4 years ago

juliohm commented 4 years ago

In the tradition of Julia, this issue follows the "Taking X seriously" convention where "X" here represents loss functions in statistical learning.

The current state of affairs of loss functions (or more generally "measures" in MLJ) is not ideal. There is a lot of code repetition that could be avoided, and a lot of machinery that could be reused in various different measures. In particular, the weighting machinery varies for different measures, and as discussed in #445 it does not serve for cost-sensitive learning, or more generally, transfer learning. Additionally, measure implementations are not necessarily ready for automatic differentiation nor they are ready for computation on GPUs.

I would like to redesign the measures in MLJ to include all important use cases, and to facilitate future additions. For that, I need your help. Before we dive into specific questions about the current traits implemented for measures, I would like to share what I think should be the high-level abstraction for measures. The definitions below are heavily inspired by the LossFunctions.jl documentation, and by a more theoretical view on empirical risk minimization.

Let's concentrate our attention to supervised loss functions, i.e. functions L(yhat, y) that operate on scalar objects ythat and y. By scalar object I only mean an object with 0 dimensions (e.g. numbers in the real line). For now I will assume that these scalar objects are <:Real, but if you feel that for example ythat should include other objects like distributions, please motivate your claim that loss functions should be the mechanism to compare numbers y with distributions ythat. It is not necessarily clear that a loss function should support this comparison.

For a supervised loss function L, we should be able to perform at least two operations:

  1. Evaluate the loss at a pair (yhat, y)
  2. Estimate the expected loss E[L] using a sample of n pairs: E[L] ~ (1/n) * sum(L(yhat_i, y_i))

In the second operation, we can also introduce a weighting function:

  1. Weighted expected loss is given by E[W*L] ~ (1/n) * sum(w_i * L(yhat_i, y_i))

where each pair has a different weight in the final estimate. This mechanism is quite important in transfer learning, where the weights are given by the ratio of the test and train distributions w(x) = p_test(x) / p_train(x). We've formalised the process of estimating these weights in DensityRatioEstimation.jl, and we need to make sure that the loss functions API consumes them correctly.

To start this discussion, I would like to go over the existing traits for measures. First, I would like to understand how each trait is currently used in other parts of MLJ.jl. Below is the full list of traits I could find:

is_measure_type

const MEASURE_TRAITS =
    [:name, :target_scitype, :supports_weights, :prediction_type, :orientation,
     :reports_each_observation, :aggregation, :is_feature_dependent, :docstring,
     :distribution_type]

I appreciate your time replying to all these questions, and apologise in advance if my words appear harsh. I am not a native English speaker, so I write with a reduced vocabulary that sometimes may sound aggressive to some.

If you can take a careful look at all these points, that would be extremely helpful. My current research depends on this, and the sooner I get your feedback, the faster I will be able to contribute.

tlienart commented 4 years ago

Thanks for taking the time to put this together, Anthony is the one who dedicated most thought on measures so he's best placed to answer. Note that there is the eventual plan of having a separate MLJMeasures package which would be a good occasion to generally improve the interface; so it's a good time to discuss this and your feedback is welcome!

juliohm commented 4 years ago

Thank you @tlienart for the feedback. A separate package would be great 💯 If you feel like adding me to the organization, I could work on the proposal therein already, otherwise I can submit PRs to the repository.

tlienart commented 4 years ago

oh it's not there yet so I think here is a good place to discuss what it could look like, thanks for the support!

juliohm commented 4 years ago

cc: @ablaom

ablaom commented 4 years ago

@juliohm

Thanks for your helpful review of the measure API. I appreciate this takes some time and effort. Thanks also for the offer to help out in an area where I agree there is room for improvement.

My first impression is that your requirements are more specialized than the needs of the general MLJ user. I hope that despite this you will appreciate that, in a broader context, the original goals of the API are generally worthwhile, and you remain willing to contribute. Let me do my best to respond to your post. I'm sorry for not responding to your comments in the same order made.

Additionally, measure implementations are not necessarily ready for automatic differentiation nor they are ready for computation on GPUs.

I agree these are worthwhile goals. It would be helpful if you could provide examples of the shortcomings, thanks.

In another thread you mentioned type-instabilities. It would be likewise helpful if you could flag examples. (I'm more concerned with evaluation of the measures here that with instantiation thereof.)

Probabilistic predictors

for example yhat should include other objects like distributions, please motivate your claim that loss functions should be the mechanism to compare numbers y with distributions yhat. It is not necessarily clear that a loss function should support this comparison.

...

Am I correct to say that the existence of target_scitype and prediction_type is due to the fact that loss functions currently compare objects of different type?

Yes.

Should it be that way? Is it covering that comparison between yhat=distribution and y=number? My opinion at this moment favors a simple interface L(yhat,y) where yhat and y are scalars of the same scientific type. I understand that yhat=f(x) is the output of a learning model with target_scitype and prediction_type, but propagating this type information seems unnecessarily complex.

The output of probabilistic predictors is varied. The predicted distributions need not be parametric or even have analytic representations (e.g., generated by MCMC). For uniformity of interface, it was decided that probabilistic models in MLJ should always predict a distribution, rather than model/domain specific, ambiguously ordered, probabilities or parameters.

An important class of performance measures for probabilistic predictions are the proper scoring rules. See, e.g., this article. Some of these rules are very general in the sense that one formula, defined in terms of the pdf, defines a loss that can be applied to large families of distribution types simultaneously. An example is the Brier score which applies not just to finite distributions but to any distribution whose pdf is suitably well-behaved. So it is very natural to implement loss functions that operate on a distribution, rather than some representation the provider and consumer must agree upon on a case-by-case basis.

Here "distribution" is a little vague; if it's finite, it should be UnivariatFinite, any other parametric distribution should generally be a Distribution.Distribution object; it should at least implement rand and if possible pdf.

I understand that limitations in the main ML platforms (scikit-learn, MLR, etc) around the performance evaluation of probabilistic predictors is a source of some frustration in the Bayesian / probabilistic programming community, and consequently a source for fragmentation between the various paradigms. The package skpro (in which yhat is allowed to be a distribution) is one response to this issue which has informed MLJ's design. See also this related article.

At present we do not implement a large number of proper scoring rules but we should like to do so at some point.

So, for our purposes, I don't agree that yhat should be restricted to a number.

I like the distinction, represented by the trait prediction_type which ensures that deterministic measures are always applied to deterministic observations, while probabilistic measures (e.g., cross_entropy) are always applied to probabilistic predictions. It eliminates confusion and provides extra interface points for the user. If you really want to apply a deterministic measure to a probabilistic prediction, specify precisely how you want this to be done. Are you computing the median? The mode? Or perhaps you are going to have a weighted mode whose weighting is learned, etc. There are convenience methods like predict_mode to deal with common use cases when evaluating a model.

distribution_type

The distribution_type trait seems to be another trait that is the result of allowing losses between objects yhat and y of different kind. Could you please elaborate on what is the meaning of this trait and how it relates to target_scitype and prediction_type?

The distribution_type was a late addition and is not currently used anywhere in the stack. It declares the type of probability distribution that can be plugged in as yhat when evaluating the measure (e.g., UnivariateFinite for cross_entropy). It is missing if prediction_type is not :probabilistic, when it has no meaning. The trait target_scitype says nothing about the nature of the probability distributions predicted by a model. It concerns the target observations, rather than predictions.

Given the fact the the type of the distribution (e.g., MCMC-generated object) might not be accessible, this trait may not be universally useful. On the other hand, I don't see it does any harm.

What is a loss function?

For a supervised loss function L, we should be able to perform at least two operations:

Evaluate the loss at a pair (yhat, y) Estimate the expected loss E[L] using a sample of n pairs: E[L] ~ (1/n) * sum(L(yhat_i, y_i)) In the second operation, we can also introduce a weighting function:

Weighted expected loss is given by E[WL] ~ (1/n) sum(w_i * L(yhat_i, y_i))

Yes, I agree that it would be nice if all performance measures in common use were defined as the mean of a per-observation measure, both from the theoretical and practical points-of-view. But many entrenched performance measures (absent from LossFunctions.jl) don't satisfy this criterion. Examples include rms and its many cousins, area under the ROC curve, and F_β scores. (Of course one could use sums of squares instead of rms but general users won't want to do this). More benign examples are things like true_positive which count instead of average the per-observation measurements (as they are conventionally defined). You may criticise the use of these measures on theoeretical grounds but you surely know they are ubiquitous.

We consequently take a more general point of view than you propose: A measure is a function applied to a sample, and we do not require that it be the aggregate of any function applied to individual observations.

In those cases where a measure applied to the sample can be recovered by aggregating its applications to the observations in isolation, one is allowed (and we generally should but don't) implemement reports_each_observation as true, which indicates the corresponding measure method returns a vector of the per-observation measurements, instead of a single value. If reports_each_observation trait is false a single value is expected.

aggregation

Measures that report_each_observation are aggregated outside of the measures API and so we require the aggregation trait to declare how the per-observation measurements are to be aggregated to obtain the correct value. Aggregation is not always by mean; rms and true_positive are two of many counter examples. Furthermore, for any measure, further aggregation occurs in resampling (e.g., CV) when aggregates from multiple samples are themselves aggregated.

Can you please elaborate on how [aggregation] is being used elsewhere in the stack?

When a model's performance is evaluated (using evaluate! or evaluate) one or more performance measures are applied to each observation in resampling (where you have a collection of train/test pairs of row indices, as in CV, for example). These per_observation measurements are aggregated to form a per_fold measurement (across the test set) and the per_fold measurements are in turn aggregated to obtain an overall measurement. For measures like auc, which do not report_each_observation, the first step is skipped (and missing reported). It is worth noting here that the per_observation scores are not discarded after aggregation, as some tuning strategies (Bayesian) make use of them. The evaluate!/evaluate methods return a named tuple with keys per_observation, per_fold, and measurement.

If you think it's worthwhile, I would be happy to allow the user to specify an alternative aggregation method at time of instantiation of a measure, with the trait specifying a default value.

orientation

I personally find the orientation trait suboptimal. I understand the desire to include multiple concepts (loss, score, etc) on the same umbrella, but we lose expressivity doing so. There will be traits in the future that only make sense when orientation=:loss or orientation=:score. You already know that my vote goes for deprecating this trait, and working on separate concepts for losses, scores, etc. It doesn't mean that we need to have different trait names for these concepts, it just means that we won't be thinking about them as a single generic concept called measures. I would like to be able for example to replace is_measure by more specific traits in my user code like is_loss or is_score. Code that consumes losses does not necessarily consumes scores, and vice-versa. So in summary, my suggestion would be to deprecate orientation, introduce is_loss, is_score,etc and finally define a new is_measure(x) = is_loss(x) || is_score(x) for the generic check.

Sorry, I guess I'm missing some use cases here. For me any loss function becomes a scoring function if I multiply by minus one, and vice-versa. I suppose it' common to suppose a loss returns a value between 0 and 1, with 1 optimal, but I was not aware this was a universal convention or essentially used anywhere. Can you provide me with an example of an algorithm that consumes loss functions that cannot also consume scores by simply multiplying the evaluations by minus one (after testing orientation trait`)?

We also want to include functions as "measures" that are neither losses or scores. One user aleady requested that confusion_matrix be admissible in performance evaluation, and this has been implemented. It's orientation is :other which means, for example, that it cannot be used in hyperparameter optimization.

reports_each_observation

  • I understand that the trait reports_each_observation tells whether or not a loss is returned for the whole sample or per pair in the sample. This doesn't make much sense to me in the context of loss functions based on the definitions above about expected losses in samples. Can you please elaborate on how this trait is being used elsewhere in MLJ? I see that L1 and L2 losses for example report the values for each observation, but wouldn't it be simpler to just broadcast the equivalent scalar losses? To me this reports_each_observation trait could be deprecated as well.

The definition of this trait is given in "What is a loss function?" above.

Several MLJ measures that don't currently report each observation could do so (especially in MLJBase/src/continuous.jl) and I am happy for them to be re-factored.

If a loss function reports_each_observation, then currently it implements both a scalar and a vector version which I agree is sub-optimal. In those cases, I agree it makes sense to require only an implementation of the scalar case, and to use trait-dispatch to reduce the vector methods to the scalar case. Of course, when reports_each_observation is false, a vector method (only) needs to be implemented.

From the definition I shared above, every loss function should support weights. The weights are not a property of the loss function itself, but a property of the expectation operator. I would just deprecate support_weights and implement the weighting mechanism outside the losses.

Yes, but your definition, as noted earlier, is too restrictive for our purposes.

Here is a proposal: We define supports_weights(m) == reports_each_observation(m) && aggregation(m) <: Union{Sum, Mean}. Pros: No need for measures to implement supports_weights; less code, more easily maintained. Cons: Considerable refactoring. No way to specify weights for general measures, such as auc and F_β-scores.

This proposal presupposes that all measures that can implement reports_each_observation indeed do so.

is_feature_dependent

Some problem-specific performance measures depend on the features X as well as y, yhat. For example, in this data science competition, losses for perishable grocery items are weighted more heavily that non-perishables (and the weighting is non-linear). We provide the is_feature_dependent trait as a mechanism for communicating that a custom performance measure depends on X (so that MLJBase.value(m, yhat, X, y, w) gets dispatched properly). See MLJ docs for an example of user interaction.

Yes, this trait would be false for all built-in measures.

is_measure trait

  • I understand that is_measure_type checks if a type is a measure type. In my opinion, the more useful trait operates on instances is_measure. How is the trait on the type being used? Can't we just rename it to is_measure and cover both cases (type + instance)?

Yes, this is a bit untidy. The is_measure_type trait is needed for model inspection. There are two facilities for this:

The is_measure trait (which can be deduced from the other, of course) is not used elsewhere in the stack in any essential way.

Our options would appear to be:

  1. keep is_measure_type and simplify the current code to require implementation of is_measure_type only

  2. re-factor to have only the is_measure trait (acting on instances) and lose the inspection functionality.

I cant think of a reason to prefer 2 over 1. Why do you say is_measure` is more useful?

Summary

In summary:

juliohm commented 4 years ago

The output of probabilistic predictors is varied. The predicted distributions need not be parametric or even have analytic representations (e.g., generated by MCMC). For uniformity of interface, it was decided that probabilistic models in MLJ should always predict a distribution, rather than model/domain specific, ambiguously ordered, probabilities or parameters.

This is ok, but it is not an argument in favor of the current API for losses.

An important class of performance measures for probabilistic predictions are the proper scoring rules. See, e.g., this article. Some of these rules are very general in the sense that one formula, defined in terms of the pdf, defines a loss that can be applied to large families of distribution types simultaneously. An example is the Brier score which applies not just to finite distributions but to any distribution whose pdf is suitably well-behaved. So it is very natural to implement loss functions that operate on a distribution, rather than some representation the provider and consumer must agree upon on a case-by-case basis.

I disagree with this view. Because scoring rules are something that can be used to track performance, it doesn't mean they fit in the concept of loss as traditionally used.

I understand that limitations in the main ML platforms (scikit-learn, MLR, etc) around the performance evaluation of probabilistic predictors is a source of some frustration in the Bayesian / probabilistic programming community, and consequently a source for fragmentation between the various paradigms. The package skpro (in which yhat is allowed to be a distribution) is one response to this issue which has informed MLJ's design. See also this related article.

At present we do not implement a large number of proper scoring rules but we should like to do so at some point.

So, for our purposes, I don't agree that yhat should be restricted to a number.

In the referred article the authors introduce a new concept called probabilistic loss functionals, which is something different than traditional loss functions, and they make it clear. These should be two separate concepts, and this attempt to make everything fit on the same bag is the issue that I am raising. I am discussing the API of traditional supervised loss functions, and in this case, it doesn't make sense to allow yhat to be a distribution.

I like the distinction, represented by the trait prediction_type which ensures that deterministic measures are always applied to deterministic observations, while probabilistic measures (e.g., cross_entropy) are always applied to probabilistic predictions. It eliminates confusion and provides extra interface points for the user.

I disagree. The current interface is confusing for the end user who is not interested in all kinds of performance metrics one can possibly conceive as a "measure". I only wish to eval my models with traditional supervised losses for a paper, and now I have to learn a complex trait to filter out what are the losses, what are the scores, what are the probabilistic functionals, what are the outputs that the model produces, and so on. This is unnecessarily complex.

Yes, I agree that it would be nice if all performance measures in common use were defined as the mean of a per-observation measure, both from the theoretical and practical points-of-view. But many entrenched performance measures (absent from LossFunctions.jl) don't satisfy this criterion. Examples include rms and its many cousins, area under the ROC curve, and F_β scores.

Exactly. And that is why we shouldn't be talking about rms as if it was a supervised loss as defined above (and in LossFunctions.jl). Something that doesn't fit the definition above deserves a separate API and set of traits.

We consequently take a more general point of view than you propose: A measure is a function applied to a sample, and we do not require that it be the aggregate of any function applied to individual observations.

This general view is useless in practice, because I need to know the nature of the function that I am applying to a sample. If I know that the function for example satisfies the definition I gave above, I can expect properties to hold. Now we have a generic thing called "measure" that puts together a bunch of different concepts on the same bag. The user is now terrified because he doesn't know what is the combination of traits he should use to filter things out.

Sorry, I guess I'm missing some use cases here. For me any loss function becomes a scoring function if I multiply by minus one, and vice-versa. I suppose it' common to suppose a loss returns a value between 0 and 1, with 1 optimal, but I was not aware this was a universal convention or essentially used anywhere. Can you provide me with an example of an algorithm that consumes loss functions that cannot also consume scores by simply multiplying the evaluations by minus one (after testing orientation trait`)?

For example, as I defined above, all losses for me are "weightable" because this is a property of the expectation operator and not of the loss. As you mentioned there are scoring rules which are not computed on a per-sample basis and not aggregated with an expectation operator. So I cannot use those.

Yes, but your definition, as noted earlier, is too restrictive for our purposes.

Again, I am not proposing a redefinition of measure, I am proposing a specific definition of loss. As I understand you have loss + scoring rules + whatever = performance measure, but I don't care about the rest of the list at this moment. Just the loss functions.

Summary

Unfortunately we have views of the world that are too different when it comes to software design. I am always willing to contribute to the MLJ stack, but I realize that it is very difficult to do so given that my research needs are not being addressed by the current design. I could try to adapt my viewpoint to contribute, but that is not efficient because the proposal you have where yhat and y have different type does not seem right conceptually, and only makes things more complex than strictly necessary. In that scenario, where I already tried to clarify my concerns with a GitHub issue as usual, I think the most productive path forward is to just fork the concepts that I am not satisfied with as I've been doing in GeoStats.jl.

If for some reason we change our minds in the future about this design, we can try to reconcile the codebases.

juliohm commented 4 years ago

I've actually just discovered that LossFunctions.jl does the weighting correctly: https://juliaml.github.io/LossFunctions.jl/stable/user/aggregate/ Sharing in case someone stumbles on the same bug here.

ParadaCarleton commented 1 year ago

@juliohm could you summarize the main issues you have with this interface? None of the issues here seem irreconcilable, and I really don't want to fragment the Julia ML ecosystem the way other interfaces and ecosystems (like named arrays or automatic differentiation) have been. There maybe be some places where we have to create different packages, but as much as possible I think we should try to make sure everything is interoperable.

To try and give a summary of the main issues I've found:

First, it looks like you want to focus on the narrower category of proper loss functions, rather than generic loss functionals. How about we create a new type called something like "Separable loss functions" that contains only losses that can be expressed as f(mean(loss(yhat, y))), where f is monotonic and equal to the identity by default? (f is there because sometimes, adding one final function call can make the resulting loss function easier to interpret, as in RMS; however, this doesn't make a difference as long as f is monotonic.)

This way we can allow generalized loss functionals without. Or, if you'd like, we could split this package into two packages, one for separable+proper loss functions and one for more "unusual" losses.

Unless we have a use case for a different aggregation method that is not the sample mean, this trait is also unnecessary.

I believe this is just a convenience for computational efficiency. It's always possible to find a function f such that invf(sum(f, x)) == accumulate(aggregator, x). For example, the logarithm to convert from products to sums. I think this could be deprecated in theory, or just pushed into some hidden corner of the documentation with a default of mean (to avoid bothering new users implementing this interface).

I could try to adapt my viewpoint to contribute, but that is not efficient because the proposal you have where yhat and y have different type does not seem right conceptually, and only makes things more complex than strictly necessary.

Can you clarify what you'd propose as an alternative interface here?

juliohm commented 1 year ago

I really don't want to fragment the Julia ML ecosystem the way other interfaces and ecosystems

I sympathize with this feeling, but please understand that I had done my homework before moving forward with the development of alternative packages. Thank you for trying to revive this issue though.

How about we create a new type called something like "Separable loss functions" that contains only losses that can be expressed as f(mean(loss(yhat, y))), where f is monotonic and equal to the identity by default?

That is JuliaML/LossFunctions.jl (I am the main maintainer nowadays).

Can you clarify what you'd propose as an alternative interface here?

I disagree with many design decisions that have been made in the project and respect them. I don't have any intention to brainstorm MLJ interfaces at this point in time. As I mentioned in another issue, we are not using the project in our industrial applications anymore.

ablaom commented 1 year ago

In case it is useful, MLJBase measures were recently moved out to StatisticalMeasures.jl. These are based on a modified system of traits that are part of StatisticalMeasuresBase.jl.

ParadaCarleton commented 1 year ago

In case it is useful, MLJBase measures were recently moved out to StatisticalMeasures.jl. These are based on a modified system of traits that are part of StatisticalMeasuresBase.jl.

Oh, this is great, it looks like the two interfaces are compatible now, so I can just use StatisticalMeasures.jl with LossFunctions.jl measures. Thank you for the hard work on this, Anthony!