STAT325-S24 / HistoryAmherstCollege

Text and analysis related to Williams S. Tyler's "History of Amherst College" (1873)
MIT License
0 stars 1 forks source link

History of Amherst College data package

STAT325 class at Amherst College (Nicholas Horton) 2024-04-14

Text and analysis related to Williams S. Tyler’s “History of Amherst College” (1873)

This repo provides the curated text from the book entitled History of Amherst College during the First Half Century.

See https://archive.org/details/historyofamherst00tyleiala for the book in a variety of formats.

Our mission was to take the text file for this historic text and make it accessible more broadly.

This file describes the HistoryAmherstCollege package.

Website

A website can be found at https://stat325-s24.github.io/HistoryAmherstCollege/

Installation instructions

devtools::install_github("STAT325-S24/HistoryAmherstCollege")
library(HistoryAmherstCollege)

Example analyses

The following is a glimpse of the packages dataframes along with a sample analysis using Named Entity Recognition.

packageVersion("HistoryAmherstCollege")
[1] '1.0'
glimpse(history_text)
Rows: 25,537
Columns: 4
$ text      <chr> "Amherst College DURING ITS FIRST HALF CENTURY.  1821-1871."…
$ chapter   <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", …
$ paragraph <dbl> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, …
$ page_num  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
mosaic::tally(~ chapter, data = history_text)
chapter
  00   01   02   03   04   05   06   07   08   09   10   11   12   13   14   15 
 157  422  394  207  398  402  402  693  536  900 1336 1284  958 1109 1192  288 
  16   17   18   19   20   21   22   23   24   25   26   27   28   29 
1313 1231  439 1298 2081 1238 1077  674  907 1486  666  308 1638  503 
glimpse(history_anno)
List of 3
 $ token   : tibble [281,272 × 10] (S3: tbl_df/tbl/data.frame)
  ..$ doc_id       : int [1:281272] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ sid          : int [1:281272] 1 1 1 1 1 1 1 1 1 2 ...
  ..$ tid          : int [1:281272] 1 2 3 4 5 6 7 8 9 1 ...
  ..$ token        : chr [1:281272] "Amherst" "College" "DURING" "ITS" ...
  ..$ token_with_ws: chr [1:281272] "Amherst " "College " "DURING " "ITS " ...
  ..$ lemma        : chr [1:281272] "Amherst" "College" "during" "its" ...
  ..$ upos         : chr [1:281272] "PROPN" "PROPN" "ADP" "PRON" ...
  ..$ xpos         : chr [1:281272] "NNP" "NNP" "IN" "PRP$" ...
  ..$ tid_source   : int [1:281272] 2 0 2 7 7 7 3 2 8 3 ...
  ..$ relation     : chr [1:281272] "compound" "root" "prep" "poss" ...
 $ entity  : tibble [19,065 × 6] (S3: tbl_df/tbl/data.frame)
  ..$ doc_id     : int [1:19065] 1 1 1 2 3 4 6 6 7 8 ...
  ..$ sid        : int [1:19065] 1 1 2 1 1 1 1 1 1 1 ...
  ..$ tid        : int [1:19065] 1 4 1 2 5 4 1 3 1 1 ...
  ..$ tid_end    : int [1:19065] 2 7 3 4 5 8 1 3 3 1 ...
  ..$ entity_type: chr [1:19065] "ORG" "DATE" "DATE" "PERSON" ...
  ..$ entity     : chr [1:19065] "Amherst College" "ITS FIRST HALF CENTURY" "1821-1871" "W. S. TYLER" ...
 $ document: tibble [25,537 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ chapter  : chr [1:25537] "00" "00" "00" "00" ...
  ..$ paragraph: num [1:25537] 1 1 1 1 2 2 2 2 3 3 ...
  ..$ page_num : num [1:25537] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ doc_id   : int [1:25537] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "class")= chr [1:2] "cnlp_annotation" "list"
mosaic::tally(~ entity$entity_type, data = history_anno)
entity$entity_type
   CARDINAL        DATE       EVENT         FAC         GPE    LANGUAGE 
       1921        4021          24         169        2164          56 
        LAW         LOC       MONEY        NORP     ORDINAL         ORG 
         38         343         343         845         627        3826 
    PERCENT      PERSON     PRODUCT    QUANTITY        TIME WORK_OF_ART 
          5        4127          89          44         269         154 
glimpse(history_subtitles)
Rows: 656
Columns: 4
$ page_number <int> 5, 6, 7, 8, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24…
$ page_header <chr> "PREFACE. 5 (v)", "6 (vi) PREFACE.  ", "PREFACE. 7 (vii)",…
$ first_line  <chr> "THIS History was a part of the plan for the Semi-Centenni…
$ chapter     <chr> "00", "00", "00", "00", "01", "01", "01", "01", "01", "01"…
glimpse(history_tables)
Rows: 15
Columns: 3
$ pdf_pg  <dbl> 15, 335, 412, 493, 506, 649, 655, 657, 668, 676, 680, 688, 693…
$ book_pg <dbl> 9, 321, 392, 465, 478, 607, 613, 615, 626, 632, 636, 644, 649,…
$ title   <chr> "Contents", "Misc_Donations", "Principal_Donations", "Religiou…
glimpse(history_figures)
Rows: 20
Columns: 2
$ pdf_pg  <dbl> 6, 40, 71, 84, 293, 341, 372, 409, 421, 425, 429, 453, 590, 59…
$ caption <chr> "view_from_ne", "view_from_mt_pleasant", "ac_in_1821", "zeph_s…
history_anno$entity |>
  filter(entity_type == "PERSON") |>
  group_by(entity) |>
  count() |>
  arrange(desc(n)) |>
  head(6)
# A tibble: 6 × 2
# Groups:   entity [6]
  entity        n
  <chr>     <int>
1 Humphrey    165
2 Hitchcock   160
3 Moore        67
4 Fiske        60
5 Stearns      58
6 Sabbath      49

The above table shows us the 6 most common names in the book. Some familiar ones immediately pop out, such as President Hitchcock, the eponym of the “Hitchcock Residence Hall” and a founding member of the American Statistical Association.

sabbath is listed incorrectly as a person.

history_anno$token |>
  filter(upos == "NOUN") |>
  group_by(token) |>
  count() |>
  arrange(desc(n)) |>
  head(6)
# A tibble: 6 × 2
# Groups:   token [6]
  token        n
  <chr>    <int>
1 years      703
2 time       614
3 students   517
4 year       390
5 dollars    379
6 class      329

The above table displays the 6 most common nouns in the text.

Notes