Yomguithereal / talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
https://yomguithereal.github.io/talisman/
MIT License
709 stars 47 forks source link

UEA-Lite short word stemming breaks #131

Open etler opened 7 years ago

etler commented 7 years ago

The issue is brought up by this project from the ruby project the stemmer was ported from:

https://github.com/ealdent/uea-stemmer/issues/3

Has there been a decision on how to resolve the issue? The suggested solution was to simply add all 3 letter words to the problem word set. Maybe the problem word set could be exposed as an option so users could provide their own?

Another potential solution is to require a minimum word length to be more than 1 letter longer than the suffix being removed. In the example, "bed" when "ed" is removed only leaves a single letter, and since there are no single consonant words, and no words that should be shortened to a root word that is only 1 letter we can skip removing the suffix if it results in a 1 letter word. This would also fix other short words that happen to include substrings that match a suffix, such as "sing" which currently is shortened to "se".

I also added a comment suggesting that solution to the above issue, as not diverging from the source implementation would be preferable.

Yomguithereal commented 7 years ago

Hello @etler, this seems to be a good solution indeed. I don't remember whether I kept the possibility not to use @ealdent's improvements when one uses the algorithm but I guess this is what we should do because I would like to keep the possibility to use the original algorithm for historical/compatibility purposes.

This said, I'd be more than willing to review a PR implementing the improvements you deem useful along with some unit tests demonstrating them and we could, like for phonetic algorithms, for instance, expose a "revisted" or "improved" version of the original algorithm.