make a proposal for what the format of the data should be

STAT325-S24 / HistoryAmherstCollege

Text and analysis related to Williams S. Tyler's "History of Amherst College" (1873)

MIT License

0 stars 1 forks source link

make a proposal for what the format of the data should be #5

Closed nicholasjhorton closed 7 months ago

nicholasjhorton commented 7 months ago

This issue will be closed when there is a proposal for how the text data should be organized for the package. Chapter? Paragraph?

Casey308 commented 7 months ago

I think we decided that we should pull the page number for the even pages (which should be straightforward because of the "history of amherst college" string), and the following line into a table. This would allow us to get the even pages, then from there we could pull out the two paragraphs, sentences, etc from the work.

Casey308 commented 7 months ago

We could also do something with a cumulative sum to get the chapter number.

nicholasjhorton commented 7 months ago

Here's what I'm thinking is the working proposal:

chap_num	chap_title	page_num	para_num	text
0	Preface	i	1	This is the preface
1	Introduction	1	5	This is the first paragraph of the intro

Note that page_num will require some magic (and separate processing) to generate the page numbers.

Since we have page_num we could later add subtitle_text to the main data table.

Comments and suggestions welcomed.

nicholasjhorton commented 7 months ago

Closed in favor of #41