Closed nicholasjhorton closed 7 months ago
I think we decided that we should pull the page number for the even pages (which should be straightforward because of the "history of amherst college" string), and the following line into a table. This would allow us to get the even pages, then from there we could pull out the two paragraphs, sentences, etc from the work.
We could also do something with a cumulative sum to get the chapter number.
Here's what I'm thinking is the working proposal:
chap_num | chap_title | page_num | para_num | text |
---|---|---|---|---|
0 | Preface | i | 1 | This is the preface |
1 | Introduction | 1 | 5 | This is the first paragraph of the intro |
Note that page_num
will require some magic (and separate processing) to generate the page numbers.
Since we have page_num
we could later add subtitle_text
to the main data table.
Comments and suggestions welcomed.
Closed in favor of #41
This issue will be closed when there is a proposal for how the text data should be organized for the package. Chapter? Paragraph?