Open irmakyucel opened 3 years ago
I tested the idea of adding a Stemmer by using two libraries that support stemming for Turkish language. The two libraries are TurkishStemmer(Snowball) and SimpleLemma. These libraries are tested on TELCO Review and Tweet Sentiment datasets for comparison reasons. These datasets are chosen because TELCO Review Model performs poorly and Tweet Sentiment Model performs well. So by comparing them I wanted to check the behavior of the stemmer in good and bad performing models. All of the versions (with/without stemmers) are optimized using Optuna. For reporting the results F1 Macro Score is used. These results are shown below:
Results
Dataset | Previous Score | w/TurkishStemmer | w/SimpleLemma |
---|---|---|---|
TELCO Review | 0.6833 | 0.6820 | 0.6755 |
Tweet Sentiment | 0.8565 | 0.8486 | 0.8489 |
The results show that not much change is made by adding a stemmer as the score either stays the same or decreases by ~0.01 points. As the change is very small, l also analyzed cross validation scores. The cross validation scores are reported in terms of Accuracy and are shown below:
Dataset | Previous Score | w/TurkishStemmer | w/SimpleLemma |
---|---|---|---|
TELCO Review | 0.6103 | 0.6109 | 0.5975 |
Tweet Sentiment | 0.8208 | 0.8106 | 0.8148 |
Adding an option for Stemming and/or Lemmatization is important when using count, hash and tf-idf vectorizers as it makes the vocabulary smaller by understanding words having same root or lemma respectively. It also makes the patterns within a dataset more visible to the model.
Stemming
Lemmatization
I believe it would be a good start to start with Stemming and then move on to Lemmatization.