Loss Functions - Githubissues

JuliaML / LossFunctions.jl

Julia package of loss functions for machine learning.

https://juliaml.github.io/LossFunctions.jl/stable

Other

147 stars 33 forks source link

Loss Functions #5

Closed Evizero closed 8 years ago

Evizero commented 8 years ago

This issue continues the discussion started at #4 concerning loss functions.

I thought a lot about our discussions so far and did a lot of targeted reading on the subject. Since the author of EmpiricalRisks seems very busy currently I am starting to agree that we should at least for now do our own Loss functions (we can always merge efforts later). I really want to keep the momentum going.

I suggest we simply go ahead and establish a package that is only concerned with loss functions but does those well. It should perfectly serve all our current usecases at least.

In fact, I made the first steps and outlined quite some code at LossFunctions.jl. The implementation is heavily based on the in-depth treatment of loss functions by Ingo Steinwart et al.

This is what I did so far:

Essentially I'd like to keep the type hierarchy small but the following abstract tree seems essential to me. Note how this allows to dispatch on the loss being a classifier or not (which I need for SVMs)

Cost,
  Loss,
    SupervisedLoss,
      MarginBasedLoss,
      DistanceBasedLoss,
    UnsupervisedLoss,

Furthermore I have included quite a collection of verbs from the literature that seem useful. Not all Losses have to support all verbs

    value,
    deriv,
    deriv2,
    value_deriv,

    value_fun,
    deriv_fun,
    deriv2_fun,
    value_deriv_fun,
    representing_fun,
    representing_deriv_fun,

    isminimizable,
    isdifferentiable,
    isconvex,
    isnemitski,
    isunivfishercons, # useful for multivariate
    isfishercons,
    islipschitzcont,
    islocallylipschitzcont,
    isclipable,
    ismarginbased,
    isclasscalibrated,
    isdistancebased,
    issymmetric

Concerning the Losses itself I actually want to provide two ways to interact with the most common ones.

Using loss functors, which is the most flexible and general approach

myloss = LogitLoss()
myloss(x) # for margin- and distance- based
myloss'(x) # for margin- and distance- based
value(myloss, ...)
deriv(myloss, ...)
deriv2(myloss, ...)
value_deriv(myloss, ...)

f = value_fun(myloss) # also possible
g = deriv_fun(myloss)

example

~~Using plain functions for simple use cases (at least for the common losses)~~ EDIT: I removed that for now to avoid confusions

logit_loss(...)
logit_deriv(...)
logit_deriv2(...)
logit_value_deriv(...)

for now I have just implemented LogitLoss, L1Loss and L2Loss as examples. But once we come to some agreement I would happily implement any loss that I can think of.

What do you think? Any suggestions or criticism?

tbreloff commented 8 years ago

Cool... glad you kicked this off! I agree that if we're going to be thinking about design it will be much easier to have a separate repo and consider merging later if it makes sense.

I'm a little worried about the naming, but until I dive into it a little more I'm not sure what to suggest. The names just aren't intuitive to me, but if there's good precedence for those names then I could be convinced.

Is there a time when the *_fun methods are necessary? Or are they just conveniences?

How would cross entropy loss be represented?

Final note... I don't love the idea of having lots of different packages. Maybe if LearnBase.jl gets massive we could consider breaking it apart, but I don't think we should start fragmented. It adds unnecessary complexity, makes it harder to maintain and keep versions in sync, etc.

Evizero commented 8 years ago

I'd love name suggestions for the methods. I have only one strong opinion concerning naming and that is that an end-user should never need to type unicode characters. Otherwise I am not set on the names at all (except the three loss functors, which I do like; did you mean those?) EDIT: well, essentially only the names in the type-tree are kinda "given" since they really just follow the theory 1-to-1. The verbs, however, do not and should be considered place-holders
The _fun methods are for convenience because they are easy to provide. They make sense if you don't want to carry the Loss functor and just want it as a normal function. They are useful for things like epsilon-insensitive loss which has an internal parameter
I'll sketch out cross entropy next
You do have a point there. I'll move it over here once we have settled on some basic "starter-code". I have spend the last few days re-factoring KSVM.jl according to our previous discussions (battle testing them in a sense). I think the result will provide some useful code (verbs essentially) to further inform our discussion as well as replace the place-holder code that currently occupies this package. (Spoiler: I too now agree that we should not attempt a common type tree and I want to get rid of the AbstractLearner)

datnamer commented 8 years ago

So how would pipelining etc work without a common type tree? If learners implement certain verbs, they would work in a pipeline...kinda like ducktyping?

Evizero commented 8 years ago

My bet is that it will probably end up being a mix of a flat-ish hierarchy, duck typing, and usage conventions. Without interfaces and multiple inheritance I currently don't really see a better way that's comparable in terms of simplicity and flexibility.

I'll soon go into detail what I did in KSVM the past few days and why. I hope that will highlight why some approach might be better/worse than an other. At least it should provide new grounds for discussions

But if you do have opinions on the subjects right now please do tell them in #2 . There is always room for ideas

Evizero commented 8 years ago

Concerning cross entropy: I spend some time thinking and reading how to incorporate cross entropy sensible into the above framework. Interestingly they never really seem to be discussed in detail in the same context.

The gist of it: Basically the cross-entropy loss and the already implemented logistic loss are related over the negative log-likelihood (see murphies book p 249). They are usually used in different context. This got me thinking if it would make sense to tie it to the LogitLoss class, which would probably mean that we breach GLM territory.

However to keep things simple I think we should ~~create a separate category ProbBasedLoss <: SupervisedLoss under which we~~ simply create a CrossentropyLoss. The targets there are expected to be {0, 1} instead of {-1, 1} in the margin-based case.

Bottom line I think performance and usability should be the highest priority of the loss framework

thoughts?

Evizero commented 8 years ago

So I sketched out the CrossentropyLoss. Even though conceptually it is pretty much the same thing as the LogitLoss I do agree that it does deserve it's own type

f(x) = crossentropy_loss(1, sigmoid(x))
g(x) = logit_loss(1, x)
# f and g are equivalent
@assert all([abs(f(x) - g(x)) <= 1e-14 for x = -50:0.00001:50])

Of course the CrossentropyLoss assumes that it is used in combination with a sigmoid squashing function. Otherwise the very beautiful derivative is simply wrong.

function crossentropy_loss(y::Real, t::Real)
  if y == 1
    -log(t)
  elseif y == 0
    -log(1 - t)
  else
    -y*log(t)-(1-y)*log(1-t)
  end
end
crossentropy_deriv(y::Real, t::Real) = t - y

crosentropy

tbreloff commented 8 years ago

Just to be clear, cross entropy and logit should be the same for the 2 class case... CE is the generalized form of logit used for multi-class classification. I still need to look through this in more detail so take that with a grain of salt.

My overall impression, just by first pass, is that I find the naming sort of confusing and long-winded. I'm also still hesitant about the need for the strict type hierarchies. If you can, see if you can avoid making a type until you have a strong need for it.

If you have A <: B <: C and D <: B <: C, try getting rid of B (and maybe C as well) until you start repeating code without them or you really need them to solve a dispatch problem. I think you might find B doesn't actually change the code beyond making the type trees more complex. (Ymmv... Just want you to keep that in your head... Minimal is better)

On Oct 21, 2015, at 6:10 AM, Christof Stocker notifications@github.com wrote:

So I sketched out the CrossentropyLoss. Even though conceptually it is pretty much the same thing as the LogitLoss I do agree that it does deserve it's own type

f(x) = crossentropy_loss(1, sigmoid(x)) g(x) = logit_loss(1, x)

f and g are equivalent

@assert all([abs(f(x) - g(x)) <= 1e-14 for x = -50:0.00001:50]) Of course the CrossentropyLoss assumes that I is used in combination with a sigmoid squashing function. Otherwise the very beautiful derivative is simply wrong.

function crossentropy_loss(y::Real, t::Real) if y == 1 -log(t) elseif y == 0 -log(1 - t) else -y_log(t)-(1-y)_log(1-t) end end crossentropy_deriv(y::Real, t::Real) = t - y

— Reply to this email directly or view it on GitHub.

Evizero commented 8 years ago

Just to be clear, cross entropy and logit should be the same for the 2 class case... CE is the generalized form of logit used for multi-class classification

I am not sure I follow. I am no mathematician so I could be wrong (and please please tell me if I am) but here is how I understand it:

When I say LogitLoss I am talking about L(y,t) = ln(1 + exp(-y*t)) where y in {-1, 1}. From your post I am guessing you are thinking of the crossentropy function L(y,t) = -y*ln(t) - (1-y)*ln(1-t) where y in [0, 1] as the LogitLoss. When you just think of those two as functions one sees that they depend differently on t and are simply different functions. I think this code really shows nicely how they are related:

f(x) = crossentropy_loss(1, sigmoid(x))
g(x) = logit_loss(1, x)
# f and g are equivalent
@assert all([abs(f(x) - g(x)) <= 1e-14 for x = -50:0.00001:50])

They are used in a different context. Where one uses t = w'x for the LogitLoss in empricial risk minimization (there is no sigmoid involved there and the output does not give me probabilities), one uses t = sigmoid(w'x) for the crossentropy loss in neural networks or for the Bernoulli type logistic regression that does give me class probabilities.

Furthermore, the LogitLoss I am thinking of is already universally Fisher-consistent and appropriate for the multicategory classification problem, although Zou, Zhu, and Hastie [1] state that from the likelihood point of view, the multinomial likelihood should be used.

[1] Zou, Hui, Ji Zhu, and Trevor Hastie. "New multicategory boosting algorithms based on multicategory fisher-consistent losses." The Annals of Applied Statistics 2.4 (2008): 1290-1306.

My overall impression, just by first pass, is that I find the naming sort of confusing and long-winded.

Fair enough. I am just trying to avoid ambiguous abbreviations. What kind of names are you thinking of?

I'm also still hesitant about the need for the strict type hierarchies. If you can, see if you can avoid making a type until you have a strong need for it.

I agree on that statement in general, but I would also argue that the hierarchy as it is now makes sense both theoretically (since it reflects the theory 1-to-1) and also from a programmers perspective.

A Cost if of general form L(x,y,a) where a does not need to be related to x. A Loss assumes that L(x,y,f(x)). I could be convinced to squash those into one, but then again I'd like to keep Cost as an abstract baseclass for radically different things like artificial life simulations.

A SupervisedLoss does not depend on x and it suffices to define L(y,t). An UnsupervisedLoss does not depend on y and it sufficies to define L(x,t).

MarginBasedLoss and DistanceBasedLoss are the most useful base-classes to me because they give you the power to dispatch on interpretation of the Loss (regression vs classification). They are the biggest reason why I started my own Loss-classes because I want that for KSVM.jl. I think it's wasteful to define support vector classification and support vector regression as conceptually different, because in the end it depends on the loss what you are doing. To me it seems like the cleanest most extensible solution. Also these two classes make it nice to define sub-classes of the same kind, since a MarginBaseLoss assumes that L(y,t) = f(y*t) and a DistanceBasedLoss assumes L(y,t) = f(y - t), where f is the representing function.

So I think the hierarchy has a good reason to exist in that it is useful. They reduce repeated code, induce properties on subclasses, and simply definitions. Of course I am open to be persuaded (which wouldn't be the first time in our discussions).

What do you think would be the benefit in removing them?

tbreloff commented 8 years ago

If the goal is a clean and general interface for all things machine learning, then terms like logit are always going to cause confusion (does it mean the function, the inverse, the regression? Is it {0,1} ot {-1,1}?). I certainly tend to think in "probability space" more than "margin space", and so groupings and distinctions that seem reasonable to you may not make much sense to me. I think what I'm advocating is that you spend more time on verbs and functionality right now, and little time on types. Make everything directly subtype from "abstract Loss" and see how far you can get. If you've implemented lots of things from a few different perspectives, then I think the right type tree (if any) will flow naturally. It's a LOT easier to add types than remove them, and you shouldn't add them until you NEED them to solve a problem.

Again I need to review more fully still, these are just design goals that I think will help the final product, and avoid the problem of "coding ourselves into a corner".

On Oct 21, 2015, at 9:09 AM, Christof Stocker notifications@github.com wrote:

Just to be clear, cross entropy and logit should be the same for the 2 class case... CE is the generalized form of logit used for multi-class classification

I am not sure I follow. I am no mathematician so I could be wrong (and please please tell me if I am) but here is how I understand it:

When I say LogitLoss I am talking about L(y,t) = ln(1 + exp(-y_t)) where y in {-1, 1}. From your post I am guessing you are thinking of the crossentropy function L(y,t) = -y_ln(t) - (1-y)*ln(1-t) where y in [0, 1] as the LogitLoss. When you just think of those two as functions one sees that they depend differently on t and are simply different functions. I think this code really shows nicely how they are related:

f(x) = crossentropy_loss(1, sigmoid(x)) g(x) = logit_loss(1, x)

f and g are equivalent

@assert all([abs(f(x) - g(x)) <= 1e-14 for x = -50:0.00001:50]) They are used in a different context. Where one uses t = w'x for the LogitLoss in empricial risk minimization (there is no sigmoid involved there and the output does not give me probabilities), one uses t = sigmoid(w'x) for the crossentropy loss in neural networks or for the Bernoulli type logistic regression that does give me class probabilities.

Furthermore, the LogitLoss I am thinking of is already universally Fisher-consistent and appropriate for the multicategory classification problem, although Zou, Zhu, and Hastie [1] state that from the likelihood point of view, the multinomial likelihood should be used.

[1] Zou, Hui, Ji Zhu, and Trevor Hastie. "New multicategory boosting algorithms based on multicategory fisher-consistent losses." The Annals of Applied Statistics 2.4 (2008): 1290-1306.

My overall impression, just by first pass, is that I find the naming sort of confusing and long-winded.

Fair enough. I am just trying to avoid ambiguous abbreviations. What kind of names are you thinking of?

I'm also still hesitant about the need for the strict type hierarchies. If you can, see if you can avoid making a type until you have a strong need for it.

I agree on that statement in general, but I would also argue that the hierarchy as it is now makes sense both theoretically (since it reflects the theory 1-to-1) and also from a programmers perspective.

A Cost if of general form L(x,y,a) where a does not need to be related to x. A Loss assumes that L(x,y,f(x)). I could be convinced to squash those into one, but then again I'd like to keep Cost as an abstract baseclass for radically different things like artificial life simulations.

A SupervisedLoss does not depend on x and it suffices to define L(y,t). An UnsupervisedLoss does not depend on y and it sufficies to define L(x,t).

MarginBasedLoss and DistanceBasedLoss are the most useful base-classes to me because they give you the power to dispatch on interpretation of the Loss (regression vs classification). They are the biggest reason why I started my own Loss-classes because I want that for KSVM.jl. I think it's wasteful to define support vector classification and support vector regression as conceptually different, because in the end it depends on the loss what you are doing. To me it seems like the cleanest most extensible solution. Also these two classes make it nice to define sub-classes of the same kind, since a MarginBaseLoss assumes that L(y,t) = f(y*t) and a DistanceBasedLoss assumes L(y,t) = f(y - t), where f is the representing function.

So I think the hierarchy has a good reason to exist in that it is useful. They reduce repeated code, induce properties on subclasses, and simply definitions. Of course I am open to persuasion (which wouldn't be the first time in our discussions).

What do you think would be the benefit in removing them?

— Reply to this email directly or view it on GitHub.

Evizero commented 8 years ago

Fair enough, I will keep that in mind moving forward.

I tend to think use-case oriented and since the types are currently useful to me I will go the route of removing them once we encounter a problem/inconsistency with the hierarchy instead of the other way around. So far it has been really helpful. I am currently in the process of deriving all the loss properties and implementing them. Maybe that will shed light on things or unearth unforeseen problems.

I do see the LogitLoss confusion and naming-convention clashes though. Since I want to add the distance-based version as well I might include that into the name. Something like LogitMarginLoss LogitDistLoss LogitProbLoss and maybe make typealias CrossentropyLoss LogitProbLoss. Same for L2DistLoss and L2MarginLoss.

tbreloff commented 8 years ago

Ok I think I'll have time for a more complete review in a couple weeks, at which time I might sketch out how I'd organize. If our implementations are close, great, otherwise we can have a review of the pros and cons of different styles.

On Wed, Oct 21, 2015 at 10:04 AM, Christof Stocker <notifications@github.com

wrote:

Fair enough, I will keep that in mind moving forward. I tend to think use-case oriented and since the types are currently useful to me I will go the route of removing them once we encounter a problem/inconsistency with the hierarchy instead of the other way around. So far it has been really helpful. I am currently in the process of deriving all the loss properties and implementing them. Maybe that will shed light on things or unearth unforeseen problems.

I do see the LogitLoss confusion and naming-convention clashes though. Since I want to add the distance-based version as well I might include that into the name. Something like LogitMarginLoss LogitDistLoss LogitProbLoss and maybe make typealias CrossentropyLoss LogitProbLoss. Same for L2DistLoss and L2MarginLoss.

— Reply to this email directly or view it on GitHub https://github.com/Evizero/LearnBase.jl/issues/5#issuecomment-149906191.

Evizero commented 8 years ago

Love that idea!

Evizero commented 8 years ago

I moved the code here. Basic tests are in place but overall it's still WIP.

There are still some more losses I know that I need to implement, and I haven't derived all properties for each loss yet.

Evizero commented 8 years ago

Just a small update: I am still actively working on this. Work just keeps me really busy currently.

I am pretty happy with the state of the Losses themselves. Some common ones such as the Huber Loss are still missing and I haven't implemented all properties, but these are easy additions. They have a lot of tests in place to make sure they are working correctly. One thing that I still have to do is to make sure that sparse vectors/matrices are efficient as well.

There is still some way to go with the risk and penalties (they will likely change) but it's starting to take shape. I am using them already in KSVM.jl to get a feel of what I like and dislike

Evizero commented 8 years ago

losses done, rest outsourced