Closed Yomguithereal closed 7 years ago
Glad to hear you're making a port to javascript! I do think the stemmer should be tuned to fix this. "b" for bed is not very useful.. We could add all the three letter -ed words to #problem_word?
since it would be a limited set. Otherwise a regex rule to match probably. And then we'd need to account for "led" -> "lead" separately.
I'm open to suggestions on what you think would be more useful. There are probably a ton of optimizations to be made here, but since you're porting I'll leave those particular considerations to you.
Another solution that might work is to check the word length to make sure it's more than one letter longer than the suffix being removed. Since there are no single consonant words, and no words that should be shortened to a root word that is only 1 letter we can skip removing the suffix if it results in a 1 letter word. It would also fix other short words that happen to include substrings that match a suffix, such as "sing" which currently is shortened to "se".
Pushed a fix in version 0.10.2 that should fix short words like sing, ring, bring, etc.
Couple more fixes in 0.10.3.
Forgot the case of short words ending in -ed. Still not perfect, as fed
and bled
stem as themselves instead of feed
and bleed
respectively, but a step closer. Open to pull requests, or I will get around to it eventually.
Hello @ealdent. While creating a port of the stemmer for JavaScript here, I noted that some short words are somewhat incorrectly stemmed. Notably when stumbling upon words that fill a whole rule. Rule 36 and the word "bed" is a good example of this phenomenon and will produce a stem "b".
So the question is, do you think this is fine or should we tune the stemmer a little bit to fix this?
Have a good day