Wrong explanation of Fig. 3.1.

deyanyosifov commented 8 months ago

"There are very long tails to the right for these novels (those extremely rare words!) that we have not shown in these plots." In fact, the extremely rare words have low n/total and they are at the leftmost side of the histogram. There are a lot of unique rare words that were used once or twice in a book, that's why the first column of the histogram is so high. The common words are not so many, they have high n/total and are to the right. The most common words ("a", "the", prepositions) are not even on the histograms because the x-axis has been limited to the right. For "the" in Mansfield Park n/total = 0.0386751 which is larger that 0.0009 that is the threshold of the x-axis.

juliasilge commented 8 months ago

Wow @deyanyosifov this is a typo that has made it through 5 years of corrections from users, multiple rounds of copyediting, etc. Congratulations! 😆

I'll get this corrected and submitted to the errata.

juliasilge commented 8 months ago

Ah, this was already reported to the errata and I approved it! 🙈

https://www.oreilly.com/catalog/errata.csp?isbn=9781491981658

juliasilge commented 8 months ago

I fixed this in #112 and the new version is now deployed at https://www.tidytextmining.com/tfidf

Thanks again @deyanyosifov!

dgrtwo / tidy-text-mining

Wrong explanation of Fig. 3.1. #111