elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.49k stars 24.88k forks source link

The Porter2 Stemmer Token Filter is just Porter Stemmer #2451

Closed imotov closed 11 years ago

imotov commented 12 years ago

Repro: https://gist.github.com/4165996

Examples of the words that should be stemmed by Porter and Porter2 stemmers differently

input        porter       porter2
-----------  -----------  -------
consolingly  consolingli  consol
his          hi           his
knightly     knightli     knight
stayed       stai         stay

See also: https://groups.google.com/d/topic/elasticsearch/HEW3Q9F4ocM/discussion

speedplane commented 12 years ago

I think you have it right. Currently (without the fix) the porter and porter2 stemmers map to the porter stemmer. The english stemmer maps to the porter2 stemmer.

I also did a bit more investigation and I think the lovins stemmer may have a problem too. The proper output for the lovins stemmer is: 'consol', 'hi', 'knight', 'stay', however I remember getting something different. I'll test it out and let you know.

imotov commented 12 years ago

After some discussions we came to the conclusion that it would be safer to just remove reference to the porter2 stemmer from documentation. Changing stemmer in elasticsearch might adversely affect users who are currently using it. Whoever really needs the porter2 stemmer can simply use the english stemmer instead.

speedplane commented 12 years ago

That seems like the right move. If you could add a sentence explaining that english is implemented by the porter2 stemmer, that would be nice.

As I am sure you already know, the reason it's important to be clear about the implementation is because sometimes you have to do stemming on the ES client side and you need to be sure that the client stemmer matches the ES stemmer.

imotov commented 11 years ago

I added links to stemming algorithms. Closing this issue.