dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
http://tidytextmining.com
Other
1.31k stars 803 forks source link

different output for cor.test() in 1.5 #109

Closed nataliegnelson closed 5 months ago

nataliegnelson commented 1 year ago

I believe the output for cor.test() in section 1.5 might need to be updated. When I run:

cor.test(data = frequency[frequency$author == "Brontë Sisters",],
         ~ proportion + `Jane Austen`)

I get the following output:

Pearson's product-moment correlation

data:  proportion and Jane Austen
t = 111.06, df = 10346, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7285370 0.7461189
sample estimates:
      cor 
0.7374529 
juliasilge commented 1 year ago

Are you using the datasets that are saved in this repo, or downloading the Bronte texts live from Project Gutenberg? I believe if you use the datasets that we downloaded at publication time and stored, that you will get the same results. Occasionally the way these books are formatted on Project Gutenberg changes a bit.

juliasilge commented 1 year ago

If you clone the repo, you can do this:

library(tidyverse)
library(tidytext)
library(janeaustenr)

data(stop_words)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(
           text, 
           regex("^chapter [\\divxlc]", ignore_case = TRUE))
         )) %>%
  ungroup()

original_books
#> # A tibble: 73,422 × 4
#>    text                    book                linenumber chapter
#>    <chr>                   <fct>                    <int>   <int>
#>  1 "SENSE AND SENSIBILITY" Sense & Sensibility          1       0
#>  2 ""                      Sense & Sensibility          2       0
#>  3 "by Jane Austen"        Sense & Sensibility          3       0
#>  4 ""                      Sense & Sensibility          4       0
#>  5 "(1811)"                Sense & Sensibility          5       0
#>  6 ""                      Sense & Sensibility          6       0
#>  7 ""                      Sense & Sensibility          7       0
#>  8 ""                      Sense & Sensibility          8       0
#>  9 ""                      Sense & Sensibility          9       0
#> 10 "CHAPTER 1"             Sense & Sensibility         10       1
#> # … with 73,412 more rows

tidy_books <- original_books %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
#> Joining, by = "word"

## from root of repo directory:
load("data/bronte.rda")

tidy_bronte <- bronte %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
#> Joining, by = "word"

frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
                       mutate(tidy_books, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = author, values_from = proportion) %>%
  pivot_longer(`Brontë Sisters`,
               names_to = "author", values_to = "proportion")

cor.test(data = frequency[frequency$author == "Brontë Sisters",],
         ~ proportion + `Jane Austen`)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  proportion and Jane Austen
#> t = 119.64, df = 10404, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.7527837 0.7689611
#> sample estimates:
#>       cor 
#> 0.7609907
juliasilge commented 5 months ago

Let us know if you have further questions! 🙌