STAT325-S24 / HistoryAmherstCollege

Text and analysis related to Williams S. Tyler's "History of Amherst College" (1873)
MIT License
0 stars 1 forks source link

add error-checking to dehyphenator (was: deal with hyphens in wrangled text) #2

Closed nicholasjhorton closed 7 months ago

nicholasjhorton commented 8 months ago

This issue will be closed when the workflow supports wrangling text to remove hyphens, e.g.:

on the subject is narrated at the opening of the chapter touch- 
ing the Jubilee, and may be found at page 595. The failure 

becomes

on the subject is narrated at the opening of the chapter touching
the Jubilee, and may be found at page 595. The failure 
arogers24 commented 7 months ago

The function in data-raw/dehyphenate.R mostly works. The only thing is it trims the spaces on the line from the pdf reading. I think this is fine, however, since the spaces are unnecessary. If we wanted to paste the lines together, we can add a " " separator between lines.

nicholasjhorton commented 7 months ago

That's great news.

Let's start to explore (in this issue) some of the test cases that we will want this to be able to support. (It may be helpful to use the testthat package to formalize this process.)

nicholasjhorton commented 7 months ago

See https://github.com/STAT325-S24/HistoryAmherstCollege/commit/a1b6dab7268bb6424707a657d48fc35adf2104b2 and https://github.com/STAT325-S24/HistoryAmherstCollege/tree/main/data-raw-dehyphenate for the proposed folder for the de-hyphenated text.

@Casey308 might you be willing to stub out a quarto file (and associate pdf) which reads from https://github.com/STAT325-S24/HistoryAmherstCollege/tree/main/data-raw-dehyphenate and creates a tibble of the text in data? If you could start by opening up a new issue that would be most helpful.

Casey308 commented 7 months ago

See #24 for further work on this quarto file.

nicholasjhorton commented 7 months ago

I look forward to reviewing the workflow in class today (see also #24). My hope is that it could be run in real time based on some instructions (to be documented?) as a way for us to confirm that all is sussed.

arogers24 commented 7 months ago

I committed a bunch of dehyphenated chapters to the data-raw-dehyphenate folder. There's still some bugs from the depaginate function in chapters 16, 17, and 18, so I'm waiting on those fixes to write the rest

nicholasjhorton commented 7 months ago

I tried to move this into the workflow (see https://github.com/STAT325-S24/HistoryAmherstCollege/commit/a07879b51ad5c7bec6c8159bb021e4aaf6c2a61c) but it generated the following error message:

6/7 [unnamed-chunk-3]

Quitting from lines 38-53 [unnamed-chunk-3] (02_dehyphenate.qmd)
Error in `if (str_detect(chapter_lines[i], "-$")) ...`:
! missing value where TRUE/FALSE needed
Backtrace:
 1. global fix_up_lines(chapter_name)
Execution halted

Is this the error that @arogers24 was mentioning?

See https://github.com/STAT325-S24/HistoryAmherstCollege/blob/main/data-raw/02_dehyphenate.qmd

arogers24 commented 7 months ago

@nicholasjhorton Yes, that's the error I was seeing. It's usually been due to pages ending with a hyphen. Something in the depagination wasn't working (> 5 lines I think?), so there is an 'NA' in the line and the function fails

nicholasjhorton commented 7 months ago

Sorry for the complications.

Can you point out a specific example (and add a screenshot of either the processed text or ideally the original pdf)?

arogers24 commented 7 months ago

@nicholasjhorton Here is an example from chapter 16:

Here is the original text in chapter06.txt:

ch16

Here is the depaginated text in `chapter06_cleaned.txt':

ch16_cleaned

There are two lines between the hyphen and the next text. We have only conditioned for one line between paragraphs.

nicholasjhorton commented 7 months ago

Very helpful, thanks!

@tknightly24 I see that there were five lines before the page subtitle.

Is there a list of other offending formatting of this sort that we'll want to extirpate?

nicholasjhorton commented 7 months ago

@tknightly24 @arogers24 @FranciscoJFM02 might one or more of you be able to update this issue with the current status of this work? I suspect that this is the main remaining block to finalize the text.

FranciscoJFM02 commented 7 months ago

I'm thinking a simple fix to this could be using a while loop to pass over any of these blank spaces. We will have to make sure that in this case, those lines are "" and not " ".

arogers24 commented 7 months ago

After parsing through each chapter, we've edited the original text to fit the expectations of the depaginate function, which fits the expectations of the dehyphenate function. Last commits are here (https://github.com/STAT325-S24/HistoryAmherstCollege/commit/1c85d184923017389b106f2bded4907244c5c15f)

nicholasjhorton commented 7 months ago

To help address #49 can you please add some error checking to this routine? It's generating the following error message when I try to run it:

==> quarto preview 02_dehyphenate.qmd --to pdf --no-watch-inputs --no-browse

processing file: 02_dehyphenate.qmd
  |.............................................       |  86% [unnamed-chunk-3]
Quitting from lines 38-53 [unnamed-chunk-3] (02_dehyphenate.qmd)
Error in `if (str_detect(chapter_lines[i], "-$")) ...`:
! missing value where TRUE/FALSE needed
Backtrace:
 1. global fix_up_lines(chapter_name)

Execution halted
arogers24 commented 7 months ago

@nicholasjhorton There seems to be a spot in Chapter 16 where the page number and title was not removed. I'll clear this up with Justin and let you know when we get this to work

nicholasjhorton commented 7 months ago

Done: see #49