preprocessing omission in sample code 6.2 for "The War of the Worlds"

XueWenSYan commented 2 years ago

Hi, perhaps the [gutenbergr] sources have changed since the Chapter 6.2 codes were posted. Chapters in the book 'The War of the Worlds' isn't separated by a title starting with something like 'Chapter', but rather have roman numerals (I. II. etc) indicating chapters. Hence the code in the book doesn't produce errors, but indeed only identifies the chapters for the other three books in the example.

I've personally tried something like below to identify the chapters. I think one general issue here is how we should inspect the data first for preprocessing before we proceed with the analysis. I think the book is great with showing the applications of the packages available but the examples do assume some sort of prior knowledge with the structure of the text data (e.g., knowing that there're some lines of texts called 'chapter/Chapter/ Chapter/ CHAPTER' etc that may help us separate the chapters. And indeed, small details like whether there's a space before the word Chapter or not also matters.) In practice, it is usually such nitty-gritty contextual knowledge that may lead to successful versus erroneous text data processing. The book does an excellent job in dealing more with preprocessing in the case studies towards the end. It may be even more helpful to have some contents on the importance of getting to know your data (either through a few lines of warnings or comments, or a devoted section) in the beginning chapters of the book too.

books %>% filter(title == 'The War of the Worlds') %>% mutate(chapter = cumsum(str_detect(text,regex('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}).$'))))

aliaamiri commented 2 years ago

Hi, perhaps the [gutenbergr] sources have changed since the Chapter 6.2 codes were posted. Chapters in the book 'The War of the Worlds' isn't separated by a title starting with something like 'Chapter', but rather have roman numerals (I. II. etc) indicating chapters. Hence the code in the book doesn't produce errors, but indeed only identifies the chapters for the other three books in the example.

I've personally tried something like below to identify the chapters. I think one general issue here is how we should inspect the data first for preprocessing before we proceed with the analysis. I think the book is great with showing the applications of the packages available but the examples do assume some sort of prior knowledge with the structure of the text data (e.g., knowing that there're some lines of texts called 'chapter/Chapter/ Chapter/ CHAPTER' etc that may help us separate the chapters. And indeed, small details like whether there's a space before the word Chapter or not also matters.) In practice, it is usually such nitty-gritty contextual knowledge that may lead to successful versus erroneous text data processing. The book does an excellent job in dealing more with preprocessing in the case studies towards the end. It may be even more helpful to have some contents on the importance of getting to know your data (either through a few lines of warnings or comments, or a devoted section) in the beginning chapters of the book too.

books %>% filter(title == 'The War of the Worlds') %>% mutate(chapter = cumsum(str_detect(text,regex('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3}).$'))))

I totally agree. But a simpler chunk of code worked for me to solve this problem:

books %>%
  filter(title == "The War of the Worlds") %>%
  mutate(chapter = cumsum(str_detect(text, "^[IVX]+\\.$")))

juliasilge commented 2 years ago

This is related to #85

For the book itself, we use a version of these texts that we downloaded at a certain point in time and saved. We did that because there are often changes like this in resources from the internet.

If you would like to step through the code exactly as in the book, I suggest cloning the repo locally and using the data files we saved: https://github.com/dgrtwo/tidy-text-mining/tree/master/data
If you would like to use updated texts from Project Gutenberg, then yep, you'll need to adjust the regex.

aliaamiri commented 2 years ago

Thank you for your enlightening comment 🙌.

juliasilge commented 2 years ago

Let us know if you have further questions!

dgrtwo / tidy-text-mining

preprocessing omission in sample code 6.2 for "The War of the Worlds" #102