UEA-Lite short word stemming breaks

Yomguithereal / talisman

Straightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.

MIT License

709 stars 47 forks source link

The issue is brought up by this project from the ruby project the stemmer was ported from:

https://github.com/ealdent/uea-stemmer/issues/3

Has there been a decision on how to resolve the issue? The suggested solution was to simply add all 3 letter words to the problem word set. Maybe the problem word set could be exposed as an option so users could provide their own?

Another potential solution is to require a minimum word length to be more than 1 letter longer than the suffix being removed. In the example, "bed" when "ed" is removed only leaves a single letter, and since there are no single consonant words, and no words that should be shortened to a root word that is only 1 letter we can skip removing the suffix if it results in a 1 letter word. This would also fix other short words that happen to include substrings that match a suffix, such as "sing" which currently is shortened to "se".

I also added a comment suggesting that solution to the above issue, as not diverging from the source implementation would be preferable.

Yomguithereal / talisman

UEA-Lite short word stemming breaks #131