add page numbers to history_text

STAT325-S24 / HistoryAmherstCollege

Text and analysis related to Williams S. Tyler's "History of Amherst College" (1873)

MIT License

0 stars 1 forks source link

add page numbers to history_text #41

Closed nicholasjhorton closed 5 months ago

nicholasjhorton commented 5 months ago

This will be needed for #38

Can someone start to think about this in advance of our standup on Thursday?

FranciscoJFM02 commented 5 months ago

An idea for this could be to merge the page tables with the history_text file using str_match or something similar. An issue would be dealing with the N/A values that are in the Chapter00 page table.

nicholasjhorton commented 5 months ago

I believe that this is the main remaining hurdle to overcome for the data package: might you two be willing to share some ideas prior to Tuesday's standup as a comment on this issue about how we might proceed?

nicholasjhorton commented 5 months ago

@tknightly24 as we discussed today I've added back in the first_line variable into the history_subtitles dataset.

Everyone will need to pull/fetch and reinstall the package to see these changes.

library(HistoryAmherstCollege)
dplyr::glimpse(history_subtitles)
#> Rows: 649
#> Columns: 4
#> $ page_number <int> NA, NA, NA, NA, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23…
#> $ page_header <chr> "PREFACE. v", "vi PREFACE.  ", "PREFACE. vii  ", "viii PRE…
#> $ first_line  <chr> "THIS History was a part of the plan for the Semi-Centenni…
#> $ chapter     <chr> "00", "00", "00", "00", "01", "01", "01", "01", "01", "01"…

^{Created on 2024-04-11 with reprex v2.1.0}

As a reminder, the text is full of first lines that end with the first part of a hyphenated word. For example, page 25:

Screenshot 2024-04-11 at 11 15 44 AM

See https://github.com/STAT325-S24/HistoryAmherstCollege/commit/c7dc2b6ec95a90d0bef9e0c57f4bf0e3816b413e for my commits.

nicholasjhorton commented 5 months ago

Here's what the lines look like in chapter 2:

In the first place, the first associated action, and, so far
as appears, the first impulse and movement towards the establishment
of a College in Amherst, was not in Amherst nor even

Care will be needed to find matches between history_text and history_subtitles.

nicholasjhorton commented 5 months ago

Also need to be careful with the first word being part of a hyphenated previous page line.

nicholasjhorton commented 5 months ago

This may be helpful:

x <- c("this is a test", "this is not the line you are looking for", "this is")
result <- stringr::str_locate(x, "line")
result
#>      start end
#> [1,]    NA  NA
#> [2,]    17  20
#> [3,]    NA  NA
sum(!is.na(result[,1])) # check for number of matches
#> [1] 1
which(!is.na(result[,1]))
#> [1] 2

^{Created on 2024-04-11 with reprex v2.1.0}

nicholasjhorton commented 5 months ago

Here's what happens if there is more than 1 match:

x <- c("this is a test", "this is not the line you are looking for", "this is")
result <- stringr::str_locate(x, "line")
result
#>      start end
#> [1,]    NA  NA
#> [2,]    17  20
#> [3,]    NA  NA
sum(!is.na(result[,1])) # check for number of matches
#> [1] 1
which(!is.na(result[,1]))
#> [1] 2
result <- stringr::str_locate(x, "is")
result
#>      start end
#> [1,]     3   4
#> [2,]     3   4
#> [3,]     3   4
sum(!is.na(result[,1])) # check for number of matches
#> [1] 3
which(!is.na(result[,1]))
#> [1] 1 2 3

^{Created on 2024-04-11 with reprex v2.1.0}

nicholasjhorton commented 5 months ago

See code in https://github.com/STAT325-S24/HistoryAmherstCollege/blob/main/data-raw/data.R and a number of places where I edited the source text to avoid regular expressions on the first line.