Closed GlennRicaud closed 5 years ago
Some info after research:
Example of snawball porter english stemmer output (I've downloaded java code from Snowball):
So main point about stemmers is remvoing suffixes correctly;
According to https://www.geeksforgeeks.org/introduction-to-stemming/ Porter’s Stemmer algorithm:
Advantage: It produces the best output as compared to other stemmers and it has less error rate. Limitation: Morphological variants produced are not always real words.
What about java libraries?
Java classes required to run stemmer are available for download on http://snowball.tartarus.org/download.html (I used those to produce sample output) Will search further to know if other libs available.
Is this the same being used by Lucene?
For English language it uses same Porter algorithm, but not sure if it directly uses snowball. I'll continue checking how lucene stemming works.
From Lucene analyzers Readme.txt:
This project provides pre-compiled version of the Snowball stemmers based on revision 502 of the Tartarus Snowball repository, now located at https://github.com/snowballstem/snowball/tree/e103b5c257383ee94a96e7fc58cab3c567bf079b (GitHub), together with classes integrating them with the Lucene search engine.
A few changes has been made to the static Snowball code and compiled stemmers: ...
To summarize:
This is a research task.
The goal is to find out what are the common Java libraries used to do 'stemming' (Stemming is basically finding the root of words. For example 'likes', 'liked', 'like' and 'liking' will all return 'like'),
A quick search give us Snowball and Lucene