Research - Stemming - Githubissues

GlennRicaud commented 5 years ago

This is a research task.

The goal is to find out what are the common Java libraries used to do 'stemming' (Stemming is basically finding the root of words. For example 'likes', 'liked', 'like' and 'liking' will all return 'like'),

List and gather information about each (web site, the license, if they are still maintained, the languages supported, ...)
Try them out (write some piece of code using them)

A quick search give us Snowball and Lucene

ashklianko commented 5 years ago

Some info after research:

Porter's algorithm is most commonly used (Porter himself participated in development of Snowball framework)
Lucene analyzers use Porter stemmer
Snowball has stemmers for English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian (Bokmål), Danish, Russian, Finnish languages

Example of snawball porter english stemmer output (I've downloaded java code from Snowball):

"likes"->"like" "liked"->"like" "like"->"like" "liking"->"like"
"organize"->"organ" "organizes"->"organ" "organizing"->"organ"
"love"->"love" "loving"->"love" "lovingly"->"love" "loved"->"love" "lover"->"lover" "lovely"->"love" "love"->"love"
"consist"->"consist" "consisted"->"consist" "consistency"->"consist" "consistent"->"consist" "consistently"->"consist" "consisting"->"consist" "consists"->"consist"

So main point about stemmers is remvoing suffixes correctly;

According to https://www.geeksforgeeks.org/introduction-to-stemming/ Porter’s Stemmer algorithm:

Advantage: It produces the best output as compared to other stemmers and it has less error rate. Limitation: Morphological variants produced are not always real words.

Snawball license: BSD

sigdestad commented 5 years ago

What about java libraries?

ashklianko commented 5 years ago

Java classes required to run stemmer are available for download on http://snowball.tartarus.org/download.html (I used those to produce sample output) Will search further to know if other libs available.

sigdestad commented 5 years ago

Is this the same being used by Lucene?

ashklianko commented 5 years ago

For English language it uses same Porter algorithm, but not sure if it directly uses snowball. I'll continue checking how lucene stemming works.

ashklianko commented 5 years ago

From Lucene analyzers Readme.txt:

This project provides pre-compiled version of the Snowball stemmers based on revision 502 of the Tartarus Snowball repository, now located at https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b (GitHub), together with classes integrating them with the Lucene search engine.

A few changes has been made to the static Snowball code and compiled stemmers: ...

ashklianko commented 5 years ago

To summarize:

Several major frameworks for text analysis contain stemming functionality: Lucene (Contans Snowball), StanfordNLP; These two are heavyweight and StanforNLP has limited support of languages and seems to have stemming only for English
Snowball is a cornerstone for everyone who needs pure stemming without text analysis, it also has implemented algorithms for different languages besides English
Main problem with Snowball is that it is a framework written on it's own scripting language, java code is generated and is hardly readable and maintainable.
No other major frameworks found, only standalone Porter's algorithm implementations
Just to mention - similar to stemming task is lemmatiaztion, or bringing word's forms to a dictionary form, that is possible via lemmatiaztion dictionaries, but it is much more resource intensive rather than Stemming, requires maintaing dictionaries etc. StanfordNLP does that

enonic / xp

Research - Stemming #6852