light/minimal stemming for euro languages [LUCENE-2503]

asfimport commented 14 years ago

The snowball stemmers are very aggressive and it would be nice if there were lighter alternatives.

Some applications may want to perform less aggressive stemming, for example: http://www.lucidimagination.com/search/document/5d16391e21ca6faf/plural_only_stemmer

Good, relevance tested algorithms exist and I think we should provide these alternatives.

Migrated from LUCENE-2503 by Robert Muir (@rmuir), resolved Jul 14 2010 Attachments: LUCENE-2503_modules_analysis_testdata.zip, LUCENE-2503.patch (versions: 2)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch, not ready for committing. only some of these are ready, others need tests (where I intentionally put a fail() placeholder to indicate they are still untested).

also i didn't implement the finnish one yet, but it contains various implementations for 9 euro languages.

asfimport commented 14 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Man are you fast! Does the English one deal with women/ woman and foci / focus type stuff?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Man are you fast!

not really, i've been working it for a while but since someone asked i figure i would create the issue. testing isnt done, but english, french, portuguese I think are ok. the others need a lot of tests and probably have bugs.

Does the English one deal with women/ woman and foci / focus type stuff?

Nope, the english one is the Harman "s-stemming" algorithm.

its very simple:

if final is '-ies' but not '-eies' or '-aies' then
replace '-ies' by '-y', return;
if final is '-es' but not '-aes', '-ees' or '-oes' then
replace '-es' by '-e', return;
if final is '-s' but not '-us' or '-ss' then
remove '-s';
return.

For special cases like you mentioned (if you want them), i would recommend adding these customizations yourself as documented here: http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming

just make a tab-separated file of words-stems and put a StemmerOverrideFilter(Factory) before the stemmer in the stream.

I think this alone provides a lot of flexibility. if it isn't enough, then i think these stemmers are much simpler to modify if you wanted to go that route also :)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I updated the patch, I think this is ready to go:

added finnish
created vocabulary tests from reference C,perl,whatever impls, and found/fixed bugs in every language but en,pt,fr (as promised in my last comment)
created a VocabularyAssert junit util class, and refactored the existing snowball,porter,german,and russian tests to use it, too.
refactored a bunch of utility stuff that was duplicated everywhere such as endsWith()/delete() and put it in StemmerUtil.

to apply the patch, first apply the patch itself, then please unzip the zip file containing vocabulary tests (LUCENE-2503_modules_analysis_testdata.zip) from the modules/analysis/common dir.

if no one objects, i'll commit in a few days.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

zip file containing the vocab test zipfiles, relevant to modules/analysis

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Committed revision 964019 (trunk) / 964034 (3x)

asfimport commented 13 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Bulk close for 3.1

apache / lucene

light/minimal stemming for euro languages [LUCENE-2503] #3577