MihaiValentin / lunr-languages

A collection of languages stemmers and stopwords for Lunr Javascript library
Other
430 stars 163 forks source link

Accented letter ê should be replaced by e in the french stemmer #68

Open ggrossetie opened 3 years ago

ggrossetie commented 3 years ago

Currently, "empêchaient" (verb "empêcher" conjugated in past) will be indexed as "empêch" (instead of "empech").

I'm not familiar with http://snowball.tartarus.org/ nor stemmer algorithms but according to http://snowball.tartarus.org/algorithms/french/stemmer.html this is the expected behavior. For instance, maître will produce maîtr not maitr. I find it odd, because most of the time French people will not type accented letters when searching (because it's quicker to type and most search engine will replace accented letters anyway).

For reference, here's the Lucene implementation: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java

dhdaines commented 1 month ago

Hi, you can do this separately by doing Unicode folding, as detailed here: https://github.com/dhdaines/lunr.py/blob/fix_skip_docs/docs/languages.md#folding-to-ascii

Or by using lunr-folding