Adding Stemming and Lemmatization

GlobalMaksimum / sadedegel

A General Purpose NLP library for Turkish

MIT License

93 stars 15 forks source link

Adding an option for Stemming and/or Lemmatization is important when using count, hash and tf-idf vectorizers as it makes the vocabulary smaller by understanding words having same root or lemma respectively. It also makes the patterns within a dataset more visible to the model.

Stemming

Can be achieved easily by using rule-based applications. As it aims to find the root of the word the resulting words don't have to be meaningful.
These Stems/Roots are created by removing the suffixes or prefixes used within a word.
One such rule based stemming is done in hash vectorizer of sadedegel.
So by adding other rule based methods and the one previously used in hash vectorizer we can make a class for Stemmer to be used flexibly in sadedegel platform.

Lemmatization

Lemmatization deals with finding the lemma (~başsözcük) of words. This is usually done by doing a lookup on a database. (For instance NLTK has a WordNet Lemmatizer that uses WordNet Database for lemma lookup.)
For this we would need to find a lemma database for Turkish (if such exists) and transform words by looking up from the database.
This is usually more time consuming during computation.

I believe it would be a good start to start with Stemming and then move on to Lemmatization.

Dataset	Previous Score	w/TurkishStemmer	w/SimpleLemma
TELCO Review	0.6833	0.6820	0.6755
Tweet Sentiment	0.8565	0.8486	0.8489

Dataset

Previous Score

w/TurkishStemmer

w/SimpleLemma

TELCO Review

0.6833

0.6820

0.6755

Tweet Sentiment

0.8565

0.8486

0.8489

Dataset	Previous Score	w/TurkishStemmer	w/SimpleLemma
TELCO Review	0.6103	0.6109	0.5975
Tweet Sentiment	0.8208	0.8106	0.8148

Dataset

Previous Score

w/TurkishStemmer

w/SimpleLemma

TELCO Review

0.6103

0.6109

0.5975

Tweet Sentiment

0.8208

0.8106

0.8148

GlobalMaksimum / sadedegel

Adding Stemming and Lemmatization #281