TfidfVectorizer produces incomplete words its its vectors.

PacktPublishing / Python-Natural-Language-Processing-Cookbook

Python Natural Language Processing Cookbook, published by Packt

MIT License

167 stars 98 forks source link

For some words like “happy” and “happiness” the stemming process will convert these to the same stem “happi”. These are grouped together as one feature. Stemming can convert word features to stem features, which is effective in reducing the size of features. However, there are some problems:

The stemmer rules are manually crafted based on statistics, so it’s not always correct when given a large sample vocabulary (Porter, 2001). Stems could be meaningless words that are not in dictionaries. (e.g. “is” -> “i”, “happy”->”happi”)

Source: https://tomelf.github.io/nlp/machine%20learning/text-preprocessing/

PacktPublishing / Python-Natural-Language-Processing-Cookbook

TfidfVectorizer produces incomplete words its its vectors. #5