Closed nicholasjhorton closed 7 months ago
The function in data-raw/dehyphenate.R mostly works. The only thing is it trims the spaces on the line from the pdf reading. I think this is fine, however, since the spaces are unnecessary. If we wanted to paste the lines together, we can add a " " separator between lines.
That's great news.
Let's start to explore (in this issue) some of the test cases that we will want this to be able to support.
(It may be helpful to use the testthat
package to formalize this process.)
See https://github.com/STAT325-S24/HistoryAmherstCollege/commit/a1b6dab7268bb6424707a657d48fc35adf2104b2 and https://github.com/STAT325-S24/HistoryAmherstCollege/tree/main/data-raw-dehyphenate for the proposed folder for the de-hyphenated text.
@Casey308 might you be willing to stub out a quarto file (and associate pdf) which reads from https://github.com/STAT325-S24/HistoryAmherstCollege/tree/main/data-raw-dehyphenate and creates a tibble of the text in data
? If you could start by opening up a new issue that would be most helpful.
See #24 for further work on this quarto file.
I look forward to reviewing the workflow in class today (see also #24). My hope is that it could be run in real time based on some instructions (to be documented?) as a way for us to confirm that all is sussed.
I committed a bunch of dehyphenated chapters to the data-raw-dehyphenate folder. There's still some bugs from the depaginate function in chapters 16, 17, and 18, so I'm waiting on those fixes to write the rest
I tried to move this into the workflow (see https://github.com/STAT325-S24/HistoryAmherstCollege/commit/a07879b51ad5c7bec6c8159bb021e4aaf6c2a61c) but it generated the following error message:
6/7 [unnamed-chunk-3]
Quitting from lines 38-53 [unnamed-chunk-3] (02_dehyphenate.qmd)
Error in `if (str_detect(chapter_lines[i], "-$")) ...`:
! missing value where TRUE/FALSE needed
Backtrace:
1. global fix_up_lines(chapter_name)
Execution halted
Is this the error that @arogers24 was mentioning?
See https://github.com/STAT325-S24/HistoryAmherstCollege/blob/main/data-raw/02_dehyphenate.qmd
@nicholasjhorton Yes, that's the error I was seeing. It's usually been due to pages ending with a hyphen. Something in the depagination wasn't working (> 5 lines I think?), so there is an 'NA' in the line and the function fails
Sorry for the complications.
Can you point out a specific example (and add a screenshot of either the processed text or ideally the original pdf)?
@nicholasjhorton Here is an example from chapter 16:
Here is the original text in chapter06.txt
:
Here is the depaginated text in `chapter06_cleaned.txt':
There are two lines between the hyphen and the next text. We have only conditioned for one line between paragraphs.
Very helpful, thanks!
@tknightly24 I see that there were five lines before the page subtitle.
Is there a list of other offending formatting of this sort that we'll want to extirpate?
@tknightly24 @arogers24 @FranciscoJFM02 might one or more of you be able to update this issue with the current status of this work? I suspect that this is the main remaining block to finalize the text.
I'm thinking a simple fix to this could be using a while loop to pass over any of these blank spaces. We will have to make sure that in this case, those lines are "" and not " ".
After parsing through each chapter, we've edited the original text to fit the expectations of the depaginate function, which fits the expectations of the dehyphenate function. Last commits are here (https://github.com/STAT325-S24/HistoryAmherstCollege/commit/1c85d184923017389b106f2bded4907244c5c15f)
To help address #49 can you please add some error checking to this routine? It's generating the following error message when I try to run it:
==> quarto preview 02_dehyphenate.qmd --to pdf --no-watch-inputs --no-browse
processing file: 02_dehyphenate.qmd
|............................................. | 86% [unnamed-chunk-3]
Quitting from lines 38-53 [unnamed-chunk-3] (02_dehyphenate.qmd)
Error in `if (str_detect(chapter_lines[i], "-$")) ...`:
! missing value where TRUE/FALSE needed
Backtrace:
1. global fix_up_lines(chapter_name)
Execution halted
@nicholasjhorton There seems to be a spot in Chapter 16 where the page number and title was not removed. I'll clear this up with Justin and let you know when we get this to work
Done: see #49
This issue will be closed when the workflow supports wrangling text to remove hyphens, e.g.:
becomes