Princeton-CDH / ppa-django

Princeton Prosody Archive v3.x - Python/Django web application
http://prosody.princeton.edu
Apache License 2.0
4 stars 2 forks source link

Code to generate plain text page content from EEBO-TCP XML #641 #648

Closed rlskoeser closed 4 months ago

rlskoeser commented 4 months ago

First step towards EEBO-TCP import #641

Code to read page content from a single EEBO-TCP P4 TEI xml file; includes a sample as fixture and unit tests for page logic only. If we're comfortable with the approach I'll add a few more unit tests.

CodeClimate is flagging the page content method for being too complex - I revised it and apparently made it more complex! I think it's ok to ignore that check, but please comment if you have suggestions.

rlskoeser commented 4 months ago

Updated comments / descriptions of attributes and removed unicode divider from text cnotent.