dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
http://tidytextmining.com
Other
1.32k stars 806 forks source link

Incorrect order in Figure 5.3 due to duplicate terms #47

Closed yuwen41200 closed 5 years ago

yuwen41200 commented 6 years ago

In Figure 5.3, the order of the term "-" is incorrect. I think this is because both document 1961-Kennedy and 2009-Obama have term "-", when calling reorder(term, tf_idf), the function calculate the mean of the two tf-idf's.

By the way, I got some questions when reading the book:

  1. The caption of Figure 5.4 says "[...] for four selected terms", but there are six terms in the figure.
  2. In the first paragraph of Chapter 5.3.1, I guess the rightmost ) in "For instance, performing WebCorpus(GoogleFinanceSource("NASDAQ:MSFT"))) allows us [...]" is a typo?
  3. The caption of Figure 4.4 and Figure 4.5 says "Common bigrams in Pride and Prejudice," but we doesn't filter the book Pride and Prejudice beforehand. I think these bigrams come from all of Austen's novels.

Thank you so much. I really like the book 😃

juliasilge commented 6 years ago

Thanks for reading, @yuwen41200! 🙌

The order of the terms isn't incorrect per se; it is as expected given the behavior of reorder(). This issue does seem to have confused more than just you, however, as it is also the subject of #45.

We have previously discussed adding reorder_within() to tidytext and there is a PR in juliasilge/tidytext#110; this would allow us to use that function in the book. Maybe we should go ahead and pull the trigger on that. @dgrtwo have you thought any more about this? I am now leaning toward including reorder_within() in tidytext. (I know I told you a while back I didn't think it was the right spot for it.)

juliasilge commented 5 years ago

The PR in tidytext juliasilge/tidytext#110 has now been merged so I'll make some updates to the book once the new version of tidytext is on CRAN.