Yomguithereal / talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
https://yomguithereal.github.io/talisman/
MIT License
704 stars 47 forks source link

s-stemmer deviates from paper? #157

Open markharwood opened 5 years ago

markharwood commented 5 years ago

I see that bees doesn't stem to bee and tomatoes doesn't stem to tomato.

Is this misinterpreting the logic in the original paper? I ask because I work on elasticsearch and discovered that we have a similar issue. See https://github.com/elastic/elasticsearch/issues/42892#issuecomment-502736225 for my notes on the confusion.

Yomguithereal commented 5 years ago

Hello @markharwood. That's entirely possible because I think I wrote my implementation reading Lucene's one, which should be the same as ES is using. Do you, by chance, have a link to, or the pdf, of the original article? As stated here I only could find a paper referencing the algorithm and explaining its broad intentions.

markharwood commented 5 years ago

No, I only saw the same paper as you. I've just tried sending an email to the original paper author - I'm sure she'd like to see her algorithm implemented correctly too.

markharwood commented 5 years ago

I heard back from Donna, the paper author. She agrees the bees/employees words should fall into rule 3 and remove the S. However that logic would make rule 2 redundant. Rule 1 also has some weird looking exceptions which don't appear to relate to any common English words that I know of.

The origins of the S-stemmer algorithm appear to be lost in time - Donna didn't author it and suggested the logic may be connected to the SMART system from wayback when.

Rather than trying to resolve that I've been working on an alternative plural stemmer for elasticsearch here

Yomguithereal commented 5 years ago

Cool. Can you tell me when you feel your stemmer is done and when it's merged into ES and I will be able to replicate here if you want. Or feel free to open a PR if you want to do it also.