GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish
http://sadedegel.ai
MIT License
93 stars 15 forks source link

Adding Stemming and Lemmatization #281

Open irmakyucel opened 3 years ago

irmakyucel commented 3 years ago

Adding an option for Stemming and/or Lemmatization is important when using count, hash and tf-idf vectorizers as it makes the vocabulary smaller by understanding words having same root or lemma respectively. It also makes the patterns within a dataset more visible to the model.

Stemming

Lemmatization

I believe it would be a good start to start with Stemming and then move on to Lemmatization.

irmakyucel commented 3 years ago

I tested the idea of adding a Stemmer by using two libraries that support stemming for Turkish language. The two libraries are TurkishStemmer(Snowball) and SimpleLemma. These libraries are tested on TELCO Review and Tweet Sentiment datasets for comparison reasons. These datasets are chosen because TELCO Review Model performs poorly and Tweet Sentiment Model performs well. So by comparing them I wanted to check the behavior of the stemmer in good and bad performing models. All of the versions (with/without stemmers) are optimized using Optuna. For reporting the results F1 Macro Score is used. These results are shown below:

Results

Dataset Previous Score w/TurkishStemmer w/SimpleLemma
TELCO Review 0.6833 0.6820 0.6755
Tweet Sentiment 0.8565 0.8486 0.8489

The results show that not much change is made by adding a stemmer as the score either stays the same or decreases by ~0.01 points. As the change is very small, l also analyzed cross validation scores. The cross validation scores are reported in terms of Accuracy and are shown below:

Dataset Previous Score w/TurkishStemmer w/SimpleLemma
TELCO Review 0.6103 0.6109 0.5975
Tweet Sentiment 0.8208 0.8106 0.8148