Open markharwood opened 5 years ago
Hello @markharwood. That's entirely possible because I think I wrote my implementation reading Lucene's one, which should be the same as ES is using. Do you, by chance, have a link to, or the pdf, of the original article? As stated here I only could find a paper referencing the algorithm and explaining its broad intentions.
No, I only saw the same paper as you. I've just tried sending an email to the original paper author - I'm sure she'd like to see her algorithm implemented correctly too.
I heard back from Donna, the paper author. She agrees the bees/employees words should fall into rule 3 and remove the S. However that logic would make rule 2 redundant. Rule 1 also has some weird looking exceptions which don't appear to relate to any common English words that I know of.
The origins of the S-stemmer algorithm appear to be lost in time - Donna didn't author it and suggested the logic may be connected to the SMART system from wayback when.
Rather than trying to resolve that I've been working on an alternative plural stemmer for elasticsearch here
Cool. Can you tell me when you feel your stemmer is done and when it's merged into ES and I will be able to replicate here if you want. Or feel free to open a PR if you want to do it also.
I see that
bees
doesn't stem tobee
andtomatoes
doesn't stem totomato
.Is this misinterpreting the logic in the original paper? I ask because I work on elasticsearch and discovered that we have a similar issue. See https://github.com/elastic/elasticsearch/issues/42892#issuecomment-502736225 for my notes on the confusion.