Closed ghost closed 10 years ago
Does seem like a bug, i tested against the NLTK porter stemmer and got "10" and "9".
EDIT: I found the issue. the single character "9" tokenizes as ["9"] and "10" tokenizes as ["10"].
The string "9" is on the stopwords list, "10" is not.
You can do this to avoid stripping out stopwords:
"9".tokenizeAndStem(true)
["9"]
Not going to classify this as a bug, we cant add all numbers to the stopwords list and currently theres not a single function which does stopwords filtering.
-Ken
Hello, I wonder if it's a correct output
natural.PorterStemmer.attach(); "10".tokenizeAndStem() ['10']
natural.PorterStemmer.attach(); "9".tokenizeAndStem() []
What's wrong with '9'?