enonic / xp

Enonic XP
https://enonic.com
GNU General Public License v3.0
202 stars 34 forks source link

Research - Stemming #6852

Closed GlennRicaud closed 5 years ago

GlennRicaud commented 5 years ago

This is a research task.

The goal is to find out what are the common Java libraries used to do 'stemming' (Stemming is basically finding the root of words. For example 'likes', 'liked', 'like' and 'liking' will all return 'like'),

A quick search give us Snowball and Lucene

ashklianko commented 5 years ago

Some info after research:

Example of snawball porter english stemmer output (I've downloaded java code from Snowball):

So main point about stemmers is remvoing suffixes correctly;

According to https://www.geeksforgeeks.org/introduction-to-stemming/ Porter’s Stemmer algorithm:

Advantage: It produces the best output as compared to other stemmers and it has less error rate. Limitation: Morphological variants produced are not always real words.

Snawball license: BSD

sigdestad commented 5 years ago

What about java libraries?

ashklianko commented 5 years ago

Java classes required to run stemmer are available for download on http://snowball.tartarus.org/download.html (I used those to produce sample output) Will search further to know if other libs available.

sigdestad commented 5 years ago

Is this the same being used by Lucene?

ashklianko commented 5 years ago

For English language it uses same Porter algorithm, but not sure if it directly uses snowball. I'll continue checking how lucene stemming works.

ashklianko commented 5 years ago

From Lucene analyzers Readme.txt:

This project provides pre-compiled version of the Snowball stemmers based on revision 502 of the Tartarus Snowball repository, now located at https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b (GitHub), together with classes integrating them with the Lucene search engine.

A few changes has been made to the static Snowball code and compiled stemmers: ...

ashklianko commented 5 years ago

To summarize: