utterances-bot commented 3 years ago

Focal loss implementation for LightGBM - Max Halford

Motivation If you’re reading this blog post, then you’re likely to be aware of LightGBM. The latter is a best of breed gradient boosting library. As of 2020, it’s still the go-to machine learning model for tabular data. It’s also ubiquitous in competitive machine learning. One of LightGBM’s nice features is that you can provide it with a custom loss function. Depending on what you’re doing, this may have a big positive impact.

https://maxhalford.github.io/blog/lightgbm-focal-loss/

babinu-uthup-4JESUS commented 3 years ago

Hi Max, thank you for this amazing writeup. I especially loved the way, you explained each step in detail.

I have a couple of points to note though:

In the latest version of lightgbm (I tested on version 3, which I believe is the latest), you do not need to add the initialization score to raw predictions. The values match even otherwise. You may want to update it here.
You missed the following import in the section, where you pasted the full code :

from sklearn import metrics

Thank you once again for your work!

babinu-uthup-4JESUS commented 3 years ago

I do have a clarification to make about the first point, which I mentioned above. Specifically the following line is to be avoided:

y_pred = special.expit(logloss_init_score(y_fit) + model.predict(X_test))

In fact, it should be replaced by the line below:

y_pred = special.expit(model.predict(X_test))

The part where we initialize fit and validation values appropriately using init_score, while creating lightgbm datasets is to be retained and is critical to obtaining correct results.

Thanks

MaxHalford commented 3 years ago

Hey @babinu-uthup-4JESUS, thanks a lot for your comment(s)!

In the latest version of lightgbm (I tested on version 3, which I believe is the latest), you do not need to add the initialization score to raw predictions. The values match even otherwise. You may want to update it here.

I've just checked againt version 3.1.1 and this is not true. I've updated the blog post because there was a change in the gradient sign between versions 2 and 3, but that's it. I've double checked and the custom binary logloss outputs align with LightGBM default settings.

You missed the following import in the section, where you pasted the full code

Cheers, it's fixed now!

yairdata commented 3 years ago

Hi Max, thanks for the excellent article ! i was wondering if there is a possibility to enhance the focal loss function in a way that would give a weight according to an attribute from the dataset (let's say amount of transaction if it is a fraud problem)

MaxHalford commented 3 years ago

Hello @yairdata, cheers! I believe that what you're hinting at is a weighted focal loss. That's definitely doable :). If I ever find some time, I'll write a separate blog post to discuss this aspect. In the meantime, feel free to have a go yourself.

GinWu commented 3 years ago

Hi Max, thanks for your wonderful writing. I imitated your work in my former project of multi-class task, and got a problem about init_score: LightGBMError: Number of class for initial score error. I suppose it was because of the multi-class reason, then used ndarray shape of (nsamples, num_class), but got another problem said init_score only receive 1D array.

I was confused, if the init_score not available for multi-class problem? May you figure out how to deal with this issue?

Thanks a lot!

MaxHalford commented 3 years ago

Hello @GinWu. Alas I haven't looked into the multi-class case. Maybe LightGBM uses some weird convention where the init scores, which are supposed to be a 2D array, are expected to be flattened in a 1D array?

GinWu commented 3 years ago

Oh, you're right, I tried your way to flattened the 2D array to 1D, the problem solved.

Thank you very much!

hfzarslan commented 3 years ago

what will be the derivative of multiclass focal loass with softmax????

MaxHalford commented 3 years ago

@hfzarslan you need to work that our for yourself. I don't the time to answer that question in a comment. I might write another blog post about the multi-class, but I can't make any promises.

igorkf commented 3 years ago

Hi, this was a nice post! In the case with Log Loss, when the classes are balanced, a "dummy" loss would be -log(1/N) where N is the number of classes.

How could I make this analogy for Focal Loss? Is there a "dummy" value for Focal Loss that I could drive my training to see if my estimator "beats" the dummy Focal Loss?

MaxHalford commented 3 years ago

I'm not sure @igorkf, but I guess it would boil down to -log(1 / N) if you assume the classes are balanced 🤷🏼‍♂️

nitinmnsn commented 3 years ago

Thank you for the blog. Very helpful!! One question, Don't you have to augment the predictions you are computing in the loss function and metric function with init_score as well?

MaxHalford commented 3 years ago

@nitinmnsn no you don't! They're already included. It's not a well documented behavior and it's one of the caveats to be aware of. It's one of the reasons why I wanted to write this blog post.

JiaxiangBU commented 3 years ago

Thanks for your helpful post. I search this from Google when I want to set 'base_score' like XGBoost in LigntGBM. Moreover, I learn the custom loss function with focal loss.

Totally agree with you about the trick of base score. Use the average label value is a good start to run a baseline using XGBoost and LightGBM.

SylvanLiu commented 2 years ago

I really appreciate your respected spirit of exploration. And this gorgeous essay solves a dozen of problems I have been solving recently. One point I wanna add is that there is also a function called "predict_proba" under the lightgbm.Booster, which is different from the '.predict'. But it does not support the custom loss functions of any kind xD. more details can be checked here:

https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.predict_proba

SylvanLiu commented 2 years ago

There is also one thing I think was not considered inside this essay, that ---- is the sigmoid with the current settings (as it changes its sign at 0.5) really proper in all conditions? From my point of view, there is a very important problem that the trees do not concern the sigmoid while training, which is very different from the sigmoid-ANN. Like it outputs the raw margin scores instead the real probability-like score. So, is it decent for such direct addition to the back-end output?

Like for the most ppl who use the focal loss aim to solve the unbalance-sample problems, However, after training the model, let us check the confusion matrix on the validation set, things may be strange: [[326050 38234] [118804 24031]] For me, the negative-class samples : positive is nearly 3:1, and the result shows high recall + high precision for negative class whereas medium precision + super low recall for positive class.

For this result, my explanation is: We know that the sigmoid function changes its sign at 0.5, we also know that the output of lighgbm is the raw margin score instead of the ANN-like probability. The position where it changes its sign can be regarded as the position of the decision boundary of the model we trained, and the model performance(precision and recall for both classes) shows the shape(manifold) of the decision boundary. So now, one class performs perfectly, and one not. I think it is due to we put our trained decision boudary at a wrong position, in other words, it should not be 0.5, and, we can correct it by making a new round of hyper-parameter learning (maybe use the OPTUNA). like: trial.suggest_float("turnaround_point", 1e-8, 1-1e-8, log=True). And then we are able to replace the sigmoid with a simple ">and<" sign function, which performs much more better.

SylvanLiu commented 2 years ago

To continue with the upper comment of mine, there was something apparently wrong. ---- that the lightgbm definitely concerns the sigmoid inside, while traing and validation, which can be checked here: https://github.com/microsoft/LightGBM/blob/0701a32da9fae7b2de8a01d702fa7d6abf36e836/src/objective/binary_objective.hpp in the GetGradients() impl for binary objective scenario. cuz the focal loss can be also regarded as an improvement of the BCE-loss, which only performs for binary objectives.

But what I tried to express is, we can re-construct a new sign function based on the raw margin scores, which can be regarded as a data-based debiasing after the model being trained, based on the hyper-parameter search for the location of the turnaround point.

Best regards.

luisgustavob78 commented 2 years ago

Hey Max, how are you? First of all, thank you for sharing this article, it has helped me a lot! I just have an issue when i try to calculate f1 score and the confusion matrix. I get the error messgage: "classification metrics can't handle a mix of binary and continuous targets"

How can i convert my output predictions into the binary classes that i need to classify my data?

MaxHalford commented 2 years ago

Hello @luisgustavob78! I'm good, thanks for asking.

Not too sure without looking at your data. But it's likely that your predictions are floats and not integers. Maybe you're using probabilities? You have to pass classes and not probabilities to a confusion matrix. So you have to pick a threshold and turn your probabilities into classes. Hope that helps!

pinouche commented 2 years ago

Great blog, thanks a lot!!

Small detail, you're using logarithmic loss in some places instead of logistic loss.

all the best

lcrmorin commented 2 years ago

I keep coming back to this blog post. First it was because I was building custom losses. Now it is because I try to use some logistic regression to build initial score. I can't really find a good source so I tought I might as well ask here...

The score should be initialised with log odds of the sklearn output right ? Regarding predictions; I face a similar problem as mentionned above; the result does seems better without adding the initial score ... (but it still worse than not using score_initialisation at all). Finally I try to explain the prediction (Feature importance, Shapley values). Do you know how to use / add the initial score to feature importance / shapley values ? (I've made a small Kaggle Notebook if my questions spark some interest: https://www.kaggle.com/code/lucasmorin/using-base-score-lgbm/)

MaxHalford commented 2 years ago

@pinouche

Small detail, you're using logarithmic loss in some places instead of logistic loss.

Thanks, fixed! Best.

@lcrmorin

The score should be initialised with log odds of the sklearn output right ?

I have no idea! Intuitively I don't see why bootstrapping with a logistic regression would provide better results. But why not :)

Finally I try to explain the prediction (Feature importance, Shapley values). Do you know how to use / add the initial score to feature importance / shapley values ?

It's on my list of things to explore. I found this blog post inspiring.

sushmit-goyal commented 2 years ago

Hey @MaxHalford, Thanks for the amazing writeup, it seems very handy to me. I am trying to implement a similar custom loss function but in each boosting round I need to train a NN with the extracted features from the tree and need to calculate the gradient and hessians by combining the custom loss and the loss from NN. I was wondering where I need to make the necessary changes. I am a newbie, any help would be greatly appreciated.

For context I'm trying to implement the technique in this paper.

Thanks again!

MaxHalford commented 2 years ago

Hey @sushmit-goyal! That sounds too complicated to answer off the top of my head. And sadly I don't have any time to dig into it. Sorry, but good luck!

victoriachz commented 1 year ago

Hello @MaxHalford :) Thank you for this article ! I am trying to perform hyperparameter tuning for LightGBM with Optuna and I used your code to change the objective function. However, while performing optimization, I get the same value for focal loss at each trial ... Can someone help me ? 😅

MaxHalford commented 1 year ago

Hey @victoriachz. I won't be able to help you out without some code to look at. Feel free to send me an email if you want to discuss.

victoriachz commented 1 year ago

Hi again @MaxHalford . I just sent you an email :)

Vikram12301 commented 8 months ago

I saw the confusion matrix for logloss, It is as below 0 1 0 71081 26 1 8 87

And for Focal loss it is, 0 1 0 71081 8 1 27 86 So, logloss is better than focal loss in some cases? Or is it because of the dataset?

MaxHalford / maxhalford.github.io

blog/lightgbm-focal-loss/ #12

Focal loss implementation for LightGBM - Max Halford