GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
92 stars 15 forks source link

Strip Accents option for Tokenizers #270

Open ertugrul-dmr opened 3 years ago

ertugrul-dmr commented 3 years ago

While doing error analysis I noticed that some texts are written using stripped versions of Turkish characters like çok>cok, değil>degil, ağaç>agac etc. while some of them are not. This leading to several different tokens for same word for some vectorizers.

I believe this is a worthy try to test and see if it's working.

I'll be working on this and if I get satisfactory test results then I'm going to open pull request for it.

For this purpose:

husnusensoy commented 3 years ago

Please do check the way that strip_accents works in sklearn may be we can have the same capability. But first do prove that it really improves some model or so.

ertugrul-dmr commented 3 years ago

Results:

I have implemented similar function to preprocess texts and tested them on prebuilt models. In average it decreased our results:

Prebuilt Model Original Result Preprocessed Result
Tweet Sentiment Classification 3-Fold F-1: 0.8587, 5-Fold F-1: 0.8613 3-Fold F-1: 0.8587, 5-Fold F-1: 0.8637
Movie Review Sentiment Classification F-1: 0.8258 F-1: 0.7816
Telco Tweet Sentiment Classification F-1: 0.6871, Accuracy: 0.6925 F-1: 0.694, Accuracy: 0.699
Turkish Customer Reviews Classification F-1: 0.851 F-1: 0.8132

Conclusion:

My observations are: This function does small to none improvement over models built with HashVectorizer. Meanwhile it deteriorates tf-idf models which I believe increses OOV token numbers a lot...

I suggest we can add it optionally where we might get some improvements in future uses; based on raw text and vectorizer we use...

husnusensoy commented 3 years ago

Now you are talking ...

husnusensoy commented 3 years ago

How did you perform this tests ? Sadedegel does not support accept_stripping for now ?

ertugrul-dmr commented 3 years ago

I added the strip_accents function on my local, then implemented it similar like emoji, hashtag, mention. Then tested models with strip_accents = True