dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
http://tidytextmining.com
Other
1.32k stars 805 forks source link

Wrong explanation of Fig. 3.1. #111

Closed deyanyosifov closed 8 months ago

deyanyosifov commented 8 months ago

"There are very long tails to the right for these novels (those extremely rare words!) that we have not shown in these plots." In fact, the extremely rare words have low n/total and they are at the leftmost side of the histogram. There are a lot of unique rare words that were used once or twice in a book, that's why the first column of the histogram is so high. The common words are not so many, they have high n/total and are to the right. The most common words ("a", "the", prepositions) are not even on the histograms because the x-axis has been limited to the right. For "the" in Mansfield Park n/total = 0.0386751 which is larger that 0.0009 that is the threshold of the x-axis.

juliasilge commented 8 months ago

Wow @deyanyosifov this is a typo that has made it through 5 years of corrections from users, multiple rounds of copyediting, etc. Congratulations! 😆

I'll get this corrected and submitted to the errata.

juliasilge commented 8 months ago

Ah, this was already reported to the errata and I approved it! 🙈

https://www.oreilly.com/catalog/errata.csp?isbn=9781491981658

juliasilge commented 8 months ago

I fixed this in #112 and the new version is now deployed at https://www.tidytextmining.com/tfidf

Thanks again @deyanyosifov!