Strip Accents option for Tokenizers

GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish

http://sadedegel.ai

MIT License

92 stars 15 forks source link

Strip Accents option for Tokenizers #270

Open ertugrul-dmr opened 3 years ago

ertugrul-dmr commented 3 years ago

While doing error analysis I noticed that some texts are written using stripped versions of Turkish characters like çok>cok, değil>degil, ağaç>agac etc. while some of them are not. This leading to several different tokens for same word for some vectorizers.

I believe this is a worthy try to test and see if it's working.

I'll be working on this and if I get satisfactory test results then I'm going to open pull request for it.

For this purpose:

Going to create a function that strip accents,
Test that function over some generated sentences including specific accents,
Implement it to code if the results are looking promising, Note: previous bug spotted might need to work on it first
Then test it on several prebuilt models and analyze metrics,
If it passes all above, there will be pull request for it.

husnusensoy commented 3 years ago

Please do check the way that strip_accents works in sklearn may be we can have the same capability. But first do prove that it really improves some model or so.

ertugrul-dmr commented 3 years ago

Results:

I have implemented similar function to preprocess texts and tested them on prebuilt models. In average it decreased our results:

Prebuilt Model	Original Result	Preprocessed Result
Tweet Sentiment Classification	3-Fold F-1: 0.8587, 5-Fold F-1: 0.8613	3-Fold F-1: 0.8587, 5-Fold F-1: 0.8637
Movie Review Sentiment Classification	F-1: 0.8258	F-1: 0.7816
Telco Tweet Sentiment Classification	F-1: 0.6871, Accuracy: 0.6925	F-1: 0.694, Accuracy: 0.699
Turkish Customer Reviews Classification	F-1: 0.851	F-1: 0.8132

Conclusion:

My observations are: This function does small to none improvement over models built with HashVectorizer. Meanwhile it deteriorates tf-idf models which I believe increses OOV token numbers a lot...

I suggest we can add it optionally where we might get some improvements in future uses; based on raw text and vectorizer we use...

husnusensoy commented 3 years ago

Now you are talking ...

husnusensoy commented 3 years ago

How did you perform this tests ? Sadedegel does not support accept_stripping for now ?

ertugrul-dmr commented 3 years ago

I added the strip_accents function on my local, then implemented it similar like emoji, hashtag, mention. Then tested models with strip_accents = True