Princeton-CDH / ppa-django

Princeton Prosody Archive v3.x - Python/Django web application
http://prosody.princeton.edu
Apache License 2.0
4 stars 2 forks source link

check ppa subset of eebo-tcp xml contents (div types, inline elements, check for notes) #659

Closed rlskoeser closed 2 weeks ago

rlskoeser commented 1 month ago

rerun xqueries on ppa subset of eebo-tcp (div types; check for notes, empty tags, etc)

review instructions

Next steps will be to determine which elements we want to handle differently when converting from TEI to plain text.

rlskoeser commented 1 month ago

inline elements:

elements we're not handling;

rlskoeser commented 1 month ago

Here's a list of all the unique div types that occur in our subset of documents (have not restricted it to page ranges for excerpts):

eebo_tcp_divtypes.txt

rlskoeser commented 1 month ago

Here are all the note types that occur in our subset:

Here's an example marginal note:

Here's an example of an "inter" note (also displayed at end): https://quod.lib.umich.edu/e/eebo/A65112.0001.001/1:5?rgn=div1;view=fulltext#backDLPS2

Example of "parend" note, also displayed at end: see page 152 https://quod.lib.umich.edu/e/eebo/A29229.0001.001/1:8?rgn=div1;view=fulltext#backDLPS3

mnaydan commented 3 weeks ago

@rlskoeser thanks for these notes! I think it makes sense to me, and appreciate the detail. A few initial thoughts:

rlskoeser commented 2 weeks ago

ok! figured out a way to query solr to see what the plain text looks like (I tried using the corpus export but that was too slow/inefficient). I'll share how these look in plain text in the format we're generating now to help us decide what (if any) action to take).

tables

Here's what the beginning of the table content that I linked above looks like (example p.899):


The Names.
The Symboles.

The Queene.
1.
EVPHORIS.
1.
A golden tree, laden with fruit.

Co. of Bedford.
AGLAIA.

La. Herbert.
2.
DIAPHANE.
2.
The figure Isocaedron of crystall.

Co. of Derby.
EVCAMPSE.

If we care about how the table looks in plain text, we might want rows to appear on a single line - but I'm honestly not sure we care about this for either the webapp or the nlp work. There are libraries that will generate nicely formatted plain text tables but that seems like overkill for both of our use cases.


Documenting for myself how I'm doing this in django shell in case we need to do this again:

from ppa.archive.solr import PageSearchQuerySet
psq = PageSearchQuerySet()
result = psq.filter(source_id='A04632', label=899).also('content')[0]
print(result['content'])
rlskoeser commented 2 weeks ago

inline note

Here's an example of the text we're currently generating for inline note (example p. 5). I'm putting the text of the note in bold.

another. So there are several words common to the Turks, Germans,Boxhorn. Origin. Gallic. cap. 6. & 8. Greeks, French, sometimes of the same, and sometimes of several significations; which is not sufficient to argue that all these were of the same Original.

We're currently including the text of the note inline with no spacing or indication that it's a note.

I think we should either omit the note entirely or we need to put it at the end of the page with the number. Omitting is easier; putting it at the end of the page is a small effort.

rlskoeser commented 2 weeks ago

Following on your side note: I reindexed all the eebo-tcp content and I'm getting page results for the ones I've tested, so hopefully that's no longer a problem.

mnaydan commented 2 weeks ago

@rlskoeser thanks, the reindexing seems to have solved the problem I was encountering earlier.

I agree that formatting the table otherwise seems overkill. Even with how it's formatted now, say "A golden tree, laden with fruit" was a poetry excerpt, we'd be able to find it as is, no?

I'm conflicted about the notes. I'd love to get EEBO pushed because I think it's ready, but I'm hesitant to omit the notes. For instance, footnote 3.46 on this volume has substantial prosodic discussion and even a poetry excerpt. I'd be fine with leaving it as is (appearing inline) for the web app, but the downstream effects on the full-text corpus have me wondering if it's worth the small effort to put them at the bottom of the page. What do you think?

Image

rlskoeser commented 2 weeks ago

@mnaydan thanks for finding this - I wondered if there were any like this, I'd only seen the short notes but I know you all have talked about poetry excerpts showing up in footnotes. I agree, this seems worth the effort to include the notes at the bottom of the page text. (This may be useful for other goals too.)

Can we use the same approach for all note types, including marginal notes?

mnaydan commented 2 weeks ago

@rlskoeser I don't see why not. I think transforming all note types to appear at the bottom of the page makes sense.

mnaydan commented 2 weeks ago

@rlskoeser do you want to close this issue and create a new one tracking the dev work needed for the change?

rlskoeser commented 2 weeks ago

@mnaydan you read my mind. Yes, that would be great.