Closed johnosbb closed 1 year ago
For some words like “happy” and “happiness” the stemming process will convert these to the same stem “happi”. These are grouped together as one feature. Stemming can convert word features to stem features, which is effective in reducing the size of features. However, there are some problems:
The stemmer rules are manually crafted based on statistics, so it’s not always correct when given a large sample vocabulary (Porter, 2001). Stems could be meaningless words that are not in dictionaries. (e.g. “is” -> “i”, “happy”->”happi”)
Source: https://tomelf.github.io/nlp/machine%20learning/text-preprocessing/
In the example of page 75 of the book, the output shows some truncated words: for example, "delic fine adjust" in place of "delicate fine adjust".
The text says: "The TfidfVectorizer class allows for all the functionality of CountVectorizer, except that it uses the TF-IDF algorithm to count the words instead of direct counts. " If the TF-IDF algorithm counts words why do we have sequences like: "delic fine adjust"