apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Latin Stemmer for lucene [LUCENE-6462] #7521

Open asfimport opened 9 years ago

asfimport commented 9 years ago

In the latest lucene package there is no stemmer for Latin language. I have a stemmer for latin language which is a rule based program based on the grammar and rules of Latin


Migrated from LUCENE-6462 by Niki, updated May 12 2015 Attachments: LatinStemmer.java

asfimport commented 9 years ago

Niki (migrated from JIRA)

When searching for a LatinStemmer, I found this link from Lucene/Solr https://github.com/scherziglu/solr/blob/master/solr-analysis/src/main/java/org/apache/lucene/analysis/la/LatinStemmer.java. This program does not stem most words properly and also unnecessarily adds an 'i' amongst other things. I modified the above code to accomodate the rules of stemming in Latin.

asfimport commented 9 years ago

Niki (migrated from JIRA)

This file is a replacement of previous LatinStemmer with extended rules covering different verbs and nouns.

asfimport commented 9 years ago

Niki (migrated from JIRA)

Hi Professor Chris A. Mattmann,

As we talked about it before, submitting a request to use my modified Latin Stemmer. Let me know if you need more information about what the code does.

Thanks.

asfimport commented 9 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Niki, the Latin Stemmer was originally written by Markus Klose. I would prefer to submit your patch to his Github repository: https://github.com/scherziglu/solr

If Markus wants to donate his whole code (including the TokenFilters) to Lucene, he should do this on himself to provide proper attribution to his work. The stemmer alone (as attached to this issue) is not so helpful.

In general stemmers should not necessarily produce "correct" forms, they should just "normalize" terms to something which can be compared with other terms during query execution. So before making changes to stemmers it is very important to test those changes with a corpus of latin texts and and compare the results of queries on them. For search engines, stemmers should also be light (so not to remove too much information).

In addition, this code has several problems: Why does it lookup the -que forms in a List instead of a CharArraySet?

asfimport commented 9 years ago

Chris A. Mattmann (migrated from JIRA)

Thanks for your comments Uwe. I encouraged Niki to submit her code to Lucene since she found it to be more useful than the default stemmer provided related to Latin corpuses in her work at USC. Thanks for the suggestions and I hope Niki takes you up on them - they are spot on.