ealdent / uea-stemmer

Ruby port of UEALite Stemmer - a conservative stemmer for search and indexing
http://www.uea.ac.uk/cmp/research/graphicsvisionspeech/speech/WordStemming
Apache License 2.0
53 stars 5 forks source link

Short words stemming #3

Closed Yomguithereal closed 7 years ago

Yomguithereal commented 8 years ago

Hello @ealdent. While creating a port of the stemmer for JavaScript here, I noted that some short words are somewhat incorrectly stemmed. Notably when stumbling upon words that fill a whole rule. Rule 36 and the word "bed" is a good example of this phenomenon and will produce a stem "b".

So the question is, do you think this is fine or should we tune the stemmer a little bit to fix this?

Have a good day

ealdent commented 8 years ago

Glad to hear you're making a port to javascript! I do think the stemmer should be tuned to fix this. "b" for bed is not very useful.. We could add all the three letter -ed words to #problem_word? since it would be a limited set. Otherwise a regex rule to match probably. And then we'd need to account for "led" -> "lead" separately.

I'm open to suggestions on what you think would be more useful. There are probably a ton of optimizations to be made here, but since you're porting I'll leave those particular considerations to you.

etler commented 7 years ago

Another solution that might work is to check the word length to make sure it's more than one letter longer than the suffix being removed. Since there are no single consonant words, and no words that should be shortened to a root word that is only 1 letter we can skip removing the suffix if it results in a 1 letter word. It would also fix other short words that happen to include substrings that match a suffix, such as "sing" which currently is shortened to "se".

ealdent commented 7 years ago

Pushed a fix in version 0.10.2 that should fix short words like sing, ring, bring, etc.

ealdent commented 7 years ago

Couple more fixes in 0.10.3.

Forgot the case of short words ending in -ed. Still not perfect, as fed and bled stem as themselves instead of feed and bleed respectively, but a step closer. Open to pull requests, or I will get around to it eventually.