PacktPublishing / Python-Natural-Language-Processing-Cookbook

Python Natural Language Processing Cookbook, published by Packt
MIT License
167 stars 98 forks source link

TfidfVectorizer produces incomplete words its its vectors. #5

Closed johnosbb closed 1 year ago

johnosbb commented 1 year ago

In the example of page 75 of the book, the output shows some truncated words: for example, "delic fine adjust" in place of "delicate fine adjust".

The text says: "The TfidfVectorizer class allows for all the functionality of CountVectorizer, except that it uses the TF-IDF algorithm to count the words instead of direct counts. " If the TF-IDF algorithm counts words why do we have sequences like: "delic fine adjust"

johnosbb commented 1 year ago

For some words like “happy” and “happiness” the stemming process will convert these to the same stem “happi”. These are grouped together as one feature. Stemming can convert word features to stem features, which is effective in reducing the size of features. However, there are some problems:

The stemmer rules are manually crafted based on statistics, so it’s not always correct when given a large sample vocabulary (Porter, 2001). Stems could be meaningless words that are not in dictionaries. (e.g. “is” -> “i”, “happy”->”happi”)

Source: https://tomelf.github.io/nlp/machine%20learning/text-preprocessing/