Closed nataliegnelson closed 10 months ago
Are you using the datasets that are saved in this repo, or downloading the Bronte texts live from Project Gutenberg? I believe if you use the datasets that we downloaded at publication time and stored, that you will get the same results. Occasionally the way these books are formatted on Project Gutenberg changes a bit.
If you clone the repo, you can do this:
library(tidyverse)
library(tidytext)
library(janeaustenr)
data(stop_words)
original_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(
text,
regex("^chapter [\\divxlc]", ignore_case = TRUE))
)) %>%
ungroup()
original_books
#> # A tibble: 73,422 × 4
#> text book linenumber chapter
#> <chr> <fct> <int> <int>
#> 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0
#> 2 "" Sense & Sensibility 2 0
#> 3 "by Jane Austen" Sense & Sensibility 3 0
#> 4 "" Sense & Sensibility 4 0
#> 5 "(1811)" Sense & Sensibility 5 0
#> 6 "" Sense & Sensibility 6 0
#> 7 "" Sense & Sensibility 7 0
#> 8 "" Sense & Sensibility 8 0
#> 9 "" Sense & Sensibility 9 0
#> 10 "CHAPTER 1" Sense & Sensibility 10 1
#> # … with 73,412 more rows
tidy_books <- original_books %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
#> Joining, by = "word"
## from root of repo directory:
load("data/bronte.rda")
tidy_bronte <- bronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
#> Joining, by = "word"
frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
mutate(tidy_books, author = "Jane Austen")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
pivot_wider(names_from = author, values_from = proportion) %>%
pivot_longer(`Brontë Sisters`,
names_to = "author", values_to = "proportion")
cor.test(data = frequency[frequency$author == "Brontë Sisters",],
~ proportion + `Jane Austen`)
#>
#> Pearson's product-moment correlation
#>
#> data: proportion and Jane Austen
#> t = 119.64, df = 10404, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.7527837 0.7689611
#> sample estimates:
#> cor
#> 0.7609907
Let us know if you have further questions! 🙌
I believe the output for cor.test() in section 1.5 might need to be updated. When I run:
I get the following output: