Princeton-CDH / ppa-django

Princeton Prosody Archive v3.x - Python/Django web application
http://prosody.princeton.edu
Apache License 2.0
4 stars 2 forks source link

Page indexing refactor #654

Closed rlskoeser closed 3 months ago

rlskoeser commented 3 months ago

This refactors the code for page indexing to setup work for the EEBO import.

Previously there were separate methods for Gale and HathiTrust page index data, with substantial overlap in common logic. This refactor implements source-specific page data generators, which are then consumed and updated with the common logic (like excerpt handling) and fields needed for all page records.

Also adds an efficiency improvement: when indexing pages for an excerpt, we don't need to keep iterating once we've got the data for the range we care about.

I also updated one test that has caused problems before, where we had some hard-coded solr page index data. Now it takes advantage of the common page field logic, so if page indexing logic or fields change it will automatically get those updates.