dgrtwo / tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
http://tidytextmining.com
Other
1.31k stars 803 forks source link

Chapter 1 missing introduction on getting self-generated texts into R #75

Open phish108 opened 4 years ago

phish108 commented 4 years ago

Hi David and Julia

Thank you for this nice resource. This year I use it with students for the first time. I encountered one blind spot in chapter 1 on importing self-generated text data into R. I know this is trivial if you know how to think in code and R. For the book's audience this might not be a valid assumption. Therefore, I am missing a section on organizing and reading self-generated textdata into the working environment, in addition to working with Jane Austin's books and the Gutenberg Dataset.

I suggest to my students to organize their texts (e.g. from interviews) into separate text files in a sub-directory before loading them into R. So I would really like to find something around the following boilerplate in the book.

textDirectory <- "my_own_texts"

list.files(textDirectory, "\\.txt$") %>%
    tibble(textfile = . ) %>%
    mutate(textid = rownumber()) %>%
    group_by(textfile) %>%
    mutate(
        text = str_c(textDirectory, textfile, sep = "/") %>% read_file()
    ) %>%
    ungroup() -> text_df

I think that a brief section on this little topic would make a great addition to chapter 1. It would offer readers with beginner's knowledge of working with self-generated unstructured data in R a nice way to put the concepts into practice.

juliasilge commented 4 years ago

I wonder if we could find a set of files to demonstrate this that could also substitute for the no-longer-functional data problem in #62

phish108 commented 4 years ago

What about a copy of a few Gutenberg Project's books in a separate repo to keep them stable?

For example, 5682 - 5684 are sufficiently hard to read, so better analyse them.