Closed nicholasjhorton closed 5 months ago
An idea for this could be to merge the page tables with the history_text file using str_match or something similar. An issue would be dealing with the N/A values that are in the Chapter00 page table.
I believe that this is the main remaining hurdle to overcome for the data package: might you two be willing to share some ideas prior to Tuesday's standup as a comment on this issue about how we might proceed?
@tknightly24 as we discussed today I've added back in the first_line
variable into the history_subtitles
dataset.
Everyone will need to pull/fetch and reinstall the package to see these changes.
library(HistoryAmherstCollege)
dplyr::glimpse(history_subtitles)
#> Rows: 649
#> Columns: 4
#> $ page_number <int> NA, NA, NA, NA, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23…
#> $ page_header <chr> "PREFACE. v", "vi PREFACE. ", "PREFACE. vii ", "viii PRE…
#> $ first_line <chr> "THIS History was a part of the plan for the Semi-Centenni…
#> $ chapter <chr> "00", "00", "00", "00", "01", "01", "01", "01", "01", "01"…
Created on 2024-04-11 with reprex v2.1.0
As a reminder, the text is full of first lines that end with the first part of a hyphenated word. For example, page 25:
See https://github.com/STAT325-S24/HistoryAmherstCollege/commit/c7dc2b6ec95a90d0bef9e0c57f4bf0e3816b413e for my commits.
Here's what the lines look like in chapter 2:
In the first place, the first associated action, and, so far
as appears, the first impulse and movement towards the establishment
of a College in Amherst, was not in Amherst nor even
Care will be needed to find matches between history_text
and history_subtitles
.
Also need to be careful with the first word being part of a hyphenated previous page line.
This may be helpful:
x <- c("this is a test", "this is not the line you are looking for", "this is")
result <- stringr::str_locate(x, "line")
result
#> start end
#> [1,] NA NA
#> [2,] 17 20
#> [3,] NA NA
sum(!is.na(result[,1])) # check for number of matches
#> [1] 1
which(!is.na(result[,1]))
#> [1] 2
Created on 2024-04-11 with reprex v2.1.0
Here's what happens if there is more than 1 match:
x <- c("this is a test", "this is not the line you are looking for", "this is")
result <- stringr::str_locate(x, "line")
result
#> start end
#> [1,] NA NA
#> [2,] 17 20
#> [3,] NA NA
sum(!is.na(result[,1])) # check for number of matches
#> [1] 1
which(!is.na(result[,1]))
#> [1] 2
result <- stringr::str_locate(x, "is")
result
#> start end
#> [1,] 3 4
#> [2,] 3 4
#> [3,] 3 4
sum(!is.na(result[,1])) # check for number of matches
#> [1] 3
which(!is.na(result[,1]))
#> [1] 1 2 3
Created on 2024-04-11 with reprex v2.1.0
See code in https://github.com/STAT325-S24/HistoryAmherstCollege/blob/main/data-raw/data.R and a number of places where I edited the source text to avoid regular expressions on the first line.
This will be needed for #38
Can someone start to think about this in advance of our standup on Thursday?