Refactor generate page set script to (1) support nonconsecutive works and (2) simplify page selection logic

Princeton-CDH / ppa-nlp

Discovering patterns in poetry’s data with machine learning; software for use with Princeton Prosody Archive (PPA) full-text corpus

1 stars 0 forks source link

Refactor generate page set script to (1) support nonconsecutive works and (2) simplify page selection logic #101

Open laurejt opened 2 days ago

laurejt commented 2 days ago

[ ] Support nonconsecutive PPA works
[ ] Simplify page selection

rlskoeser commented 2 days ago

To improve handling of PPA excerpt page ranges, I recommend we use intspan. In ppa-django we use the intspan package to parse page ranges, and anything in the data exports is guaranteed parsable by intspan.

Here's where we use it in the ppa-django DigitizedWork model: https://github.com/Princeton-CDH/ppa-django/blob/main/ppa/archive/models.py#L1047-L1053

It returns an object that can be treated as an iterable with all page numbers included in the span.