NaturalNode / natural

general natural language facilities for node
MIT License
10.65k stars 859 forks source link

Aggressive stemming #148

Closed ghost closed 10 years ago

ghost commented 10 years ago

Hello, I wonder if it's a correct output

natural.PorterStemmer.attach(); "10".tokenizeAndStem() ['10']

natural.PorterStemmer.attach(); "9".tokenizeAndStem() []

What's wrong with '9'?

kkoch986 commented 10 years ago

Does seem like a bug, i tested against the NLTK porter stemmer and got "10" and "9".

EDIT: I found the issue. the single character "9" tokenizes as ["9"] and "10" tokenizes as ["10"].

The string "9" is on the stopwords list, "10" is not.

You can do this to avoid stripping out stopwords:

"9".tokenizeAndStem(true)
["9"]

Not going to classify this as a bug, we cant add all numbers to the stopwords list and currently theres not a single function which does stopwords filtering.

-Ken