dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
http://tidytextmining.com
Other
1.31k stars 803 forks source link

Comparing word frequencies 3 ways #79

Closed GabriellaS-K closed 3 years ago

GabriellaS-K commented 3 years ago

Hi,

I was wondering if there is a way to compare word frequencies 3 ways? In your book, section 1.5 there is a walkthrough to calculate the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells by binding the data frames together. When gathered at the end, it compares Jane to Brontë and Jane to Wells, and proceeds to do the plot like this. Is there a way to compare Jane to Brontë and Jane to Wells and Brontë to Wells? Finishing with 3 plots together rather than two?

Pasted your code from the book below. Thanks for such an amazing resource!


library(tidyr)

frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
                       mutate(tidy_hgwells, author = "H.G. Wells"), 
                       mutate(tidy_books, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(author, proportion) %>% 
  gather(author, proportion, `Brontë Sisters`:`H.G. Wells`)

Now let’s plot (Figure 1.3).


library(scales)

# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `Jane Austen`, color = abs(`Jane Austen` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Jane Austen", x = NULL)
juliasilge commented 3 years ago

My favorite way to do this now is using the tidylo package. You can check out the vignette, and also this blog post. For example, check out this plot comparing four groups:

image

This does look at log odds, rather than word frequencies, but it is pretty nice.

GabriellaS-K commented 3 years ago

Oh, wonderful. I will look at this! Thank you so much