MaxHalford / maxhalford.github.io

:house_with_garden: Personal website
https://maxhalford.github.io
MIT License
13 stars 5 forks source link

blog/target-encoding/ #8

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Target encoding done the right way - Max Halford

When you’re doing supervised learning, you often have to deal with categorical variables. That is, variables which don’t have a natural numerical representation. The problem is that most machine learning algorithms require the input data to be numerical. At some point or another a data science pipeline will require converting categorical variables to numerical variables. There are many ways to do so: Label encoding where you choose an arbitrary number for each category One-hot encoding where you create one binary column per category Vector representation a.

https://maxhalford.github.io/blog/target-encoding/

masinde70 commented 3 years ago

Thanks!

Mo-Saif commented 3 years ago

Great article, thank you.

kmsandeep commented 3 years ago

Awesome explanation. Thank you.

hristochr commented 3 years ago

This has been very helpful and I have applied this method with success. However, once a I have a trained model based on the so encoded data: how to encode correctly in the same way the brand new data? For new data in a production scenario surely I don't have a target vector to average over. Would you just substitute the incoming categorical values with the target encoding value from the training set?

MaxHalford commented 3 years ago

@hristochr yeah that's the idea. In fact, that's the whole point of introducing a prior, because it allows not overfitting the mean target value for new, previously unseen categories.

bnriiitb commented 3 years ago

Excellent article 👏

y0203j commented 2 years ago

Great post. Just wondering how did you find frequency encoding? Thanks

MaxHalford commented 2 years ago

Hey @y0203j. I can't recall exactly, but I remember using quite a lot during Kaggle competitions. For instance a breakthrough was made in the Santander competition thanks to frequency encoding.

rm17tink commented 2 years ago

When do the sample above df.groupby('x_1')['y'].mean() I get: [.55,0] not [.44,0]

MaxHalford commented 2 years ago

Good catch @rm17tink! Just fixed it.

lcrmorin commented 2 years ago

I like encoding with Weights of Evidence. It gives very stable and additive baseline.

shiv1991 commented 1 year ago

Hey Thanks for the details, how would decode to my original after encoding. As I wanted it for post prediction analysis

MaxHalford commented 1 year ago

Hey @shiv1991, I'm not sure what you mean 😅. You want to convert the original dataframe to original column(s) of categorical values? That sounds a bit out-of-scope of this post. I'm sure you can do that with some pandas karate.

Nevermetyou65 commented 1 year ago

HI sir I really appreciated the content. It so good. But I found a related article about target encoding and I found the formula to compute encoded value to be different from yours.

Could you clarify me about the differences please? here is the aforementioned article https://towardsdatascience.com/dealing-with-categorical-variables-by-using-target-encoder-a0f1733a4c69

MaxHalford commented 1 year ago

Hey @Nevermetyou65. Thanks for the kind words.

The idea behind the method you linked is the same: we want to "smooth" the mean to counteract the fact it is biased by the low amount of samples. The method you linked is based on this paper. I'm sure you would get similar results if you trained an ML classifier using both methods. However, I prefer my method because it only has (intuitive) parameter to tune. The other method has two parameters (f and k). If you have time, I would suggest trying out both methods. I hope that helps!

achilela commented 9 months ago

Well sharpened to the point. Easy read to follow and grasp concepts. Will have a try following the DataCamp course on CI/CD in Machine Learning.

parthadh commented 2 months ago

sorry, this link is broken: https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf

could relink the discussion?

lcrmorin commented 2 months ago

Nowadays the state of the art would be optbinning. At the core it is a binning library that provides solutions for encoding. Generally well rounded and easy to use with lot of quality of life tool (plots).