dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
http://tidytextmining.com
Other
1.32k stars 805 forks source link

Ch 7: some code seems inconsistent with text #48

Closed yuwen41200 closed 6 years ago

yuwen41200 commented 6 years ago

Some code seems inconsistent with its description:

word_ratios <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  count(word, person) %>%
  group_by(word) %>%
  filter(sum(n) >= 10) %>%
  ungroup() %>%
  spread(person, n, fill = 0) %>%
  mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>%
  # should it be: mutate_if(is.numeric, funs((. + 1) / (sum(.) + 1)))
  mutate(logratio = log(David / Julia)) %>%
  arrange(desc(logratio))
words_by_time <- tidy_tweets %>%
  filter(!str_detect(word, "^@")) %>%
  mutate(time_floor = floor_date(timestamp, unit = "1 month")) %>%
  count(time_floor, person, word) %>%
  group_by(person, time_floor) %>%
  mutate(time_total = sum(n)) %>%
  group_by(word) %>%
  # should it be: group_by(person, word)
  mutate(word_total = sum(n)) %>%
  ungroup() %>%
  rename(count = n) %>%
  filter(word_total > 30)
totals <- tidy_tweets %>%
  group_by(person, id) %>%
  summarise(rts = sum(retweets)) %>%
  # should it be: summarise(rts = first(retweets))
  group_by(person) %>%
  summarise(total_rts = sum(rts))
totals <- tidy_tweets %>%
  group_by(person, id) %>%
  summarise(favs = sum(favorites)) %>%
  # should it be: summarise(favs = first(favorites))
  group_by(person) %>%
  summarise(total_favs = sum(favs))

Besides, while I just ran the same code, the output I got is different from the one on the website.

Thank you for any help you can provide.

juliasilge commented 6 years ago

Hello there, @yuwen41200! First off, we had some somewhat tangled code chunk names that were resulting in the output you were getting being different from the output that was online. Thanks so much for your careful reading; this is now fixed. 🙌

Let's talk through the four code chunks you have here.

1)

Here we are writing code to implement the log odds ratio as written out in equation form here. The difference between what you have and what was originally in the book is negligible in terms of the numeric value of the log odds ratio that you get in the end, but what you have is more in line with the equation as printed. The 1s are there to care of any zero and dividing by zero issues, FYI.

2)

Here, word_total is counting up how many times both users have used any given word, together. Since we are modeling to find words that are changing in prevalence for either one of us, I'd argue it makes sense to filter for the overall prevalence. This is a data cleaning/prep step, and I could see that another analyst might make a different choice.

3) and 4)

These are straight up errors! Now fixed in commit d0194a7883213ad2ac07b4a88e3fa15b0ef29f12. Thanks so much for helping us improve the book.

yuwen41200 commented 6 years ago

Thanks for your reply! Regarding the 2nd chunk, because the text before and after the chunk says "After that, we add columns to the data frame for [...] and the total number of times each word was used by each person." and "Each row in this data frame corresponds to one person using one word in a given time bin. [...] the word_total column tells us how many times that person used that word over the whole year.", it may confuse readers.

juliasilge commented 6 years ago

Ah, you know what? This is because of a change in how dplyr handles grouping, when you group several times in a pipe. I have adjusted the code and text to match now. Thanks again for all your contributions!