different output for cor.test() in 1.5

nataliegnelson commented 2 years ago

I believe the output for cor.test() in section 1.5 might need to be updated. When I run:

cor.test(data = frequency[frequency$author == "Brontë Sisters",],
         ~ proportion + `Jane Austen`)

I get the following output:

Pearson's product-moment correlation

data:  proportion and Jane Austen
t = 111.06, df = 10346, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7285370 0.7461189
sample estimates:
      cor 
0.7374529

juliasilge commented 2 years ago

Are you using the datasets that are saved in this repo, or downloading the Bronte texts live from Project Gutenberg? I believe if you use the datasets that we downloaded at publication time and stored, that you will get the same results. Occasionally the way these books are formatted on Project Gutenberg changes a bit.

juliasilge commented 2 years ago

If you clone the repo, you can do this:

library(tidyverse)
library(tidytext)
library(janeaustenr)

data(stop_words)

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(
           text, 
           regex("^chapter [\\divxlc]", ignore_case = TRUE))
         )) %>%
  ungroup()

original_books
#> # A tibble: 73,422 × 4
#>    text                    book                linenumber chapter
#>    <chr>                   <fct>                    <int>   <int>
#>  1 "SENSE AND SENSIBILITY" Sense & Sensibility          1       0
#>  2 ""                      Sense & Sensibility          2       0
#>  3 "by Jane Austen"        Sense & Sensibility          3       0
#>  4 ""                      Sense & Sensibility          4       0
#>  5 "(1811)"                Sense & Sensibility          5       0
#>  6 ""                      Sense & Sensibility          6       0
#>  7 ""                      Sense & Sensibility          7       0
#>  8 ""                      Sense & Sensibility          8       0
#>  9 ""                      Sense & Sensibility          9       0
#> 10 "CHAPTER 1"             Sense & Sensibility         10       1
#> # … with 73,412 more rows

tidy_books <- original_books %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
#> Joining, by = "word"

## from root of repo directory:
load("data/bronte.rda")

tidy_bronte <- bronte %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
#> Joining, by = "word"

frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
                       mutate(tidy_books, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  pivot_wider(names_from = author, values_from = proportion) %>%
  pivot_longer(`Brontë Sisters`,
               names_to = "author", values_to = "proportion")

cor.test(data = frequency[frequency$author == "Brontë Sisters",],
         ~ proportion + `Jane Austen`)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  proportion and Jane Austen
#> t = 119.64, df = 10404, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.7527837 0.7689611
#> sample estimates:
#>       cor 
#> 0.7609907

juliasilge commented 10 months ago

Let us know if you have further questions! 🙌

dgrtwo / tidy-text-mining

different output for cor.test() in 1.5 #109