check ppa subset of eebo-tcp xml contents (div types, inline elements, check for notes)

rlskoeser commented 3 months ago

rerun xqueries on ppa subset of eebo-tcp (div types; check for notes, empty tags, etc)

review instructions

[x] do these notes make sense and provide sufficient detail?

Next steps will be to determine which elements we want to handle differently when converting from TEI to plain text.

rlskoeser commented 3 months ago

inline elements:

milestone
- example xml <MILESTONE N="5" UNIT="argument key no."/>
- example display "[argument key no. 5]" - https://quod.lib.umich.edu/e/eebo/A12229.0001.001/1:4.2?rgn=div2;view=fulltext
- figure; figure in a marginal note
- empty list item in a list of plays
- gap inside a stage tag (?) with place=marg (margin)
- lb tag (line beginning)
- empty paragraph tags
- empty table cell ->

elements we're not handling;

table:
- example on page 899 https://quod.lib.umich.edu/e/eebo/A04632.0001.001/1:160?firstpubl1=1470;firstpubl2=1700;rgn=div1;singlegenre=All;sort=occur;subview=detail;type=simple;view=fulltext;q1=payre+of+naked+feet
- example search result https://test-prosody.cdh.princeton.edu/archive/A04632/?query=+payre+of+naked+feet

rlskoeser commented 3 months ago

Here's a list of all the unique div types that occur in our subset of documents (have not restricted it to page ranges for excerpts):

eebo_tcp_divtypes.txt

rlskoeser commented 3 months ago

Here are all the note types that occur in our subset:

marg (~ margin)
foot
foot1 (occurs once, in A52335 on <PB N="xxv" REF="163"/>; typo?)
inter
parend (~ paragraph end?)

Here's an example marginal note:

xml: handling of that subject,<NOTE PLACE="marg">§. I.</NOTE> I have here proposed
eebo-tcp displays it as a linked footnote: https://quod.lib.umich.edu/e/eebo/A66045.0001.001/1:7.1?rgn=div2;view=fulltext
ppa currently includes note text inline: https://test-prosody.cdh.princeton.edu/archive/A66045/?query=%22handling+of+that+subject%22

Here's an example of an "inter" note (also displayed at end): https://quod.lib.umich.edu/e/eebo/A65112.0001.001/1:5?rgn=div1;view=fulltext#backDLPS2

Example of "parend" note, also displayed at end: see page 152 https://quod.lib.umich.edu/e/eebo/A29229.0001.001/1:8?rgn=div1;view=fulltext#backDLPS3

mnaydan commented 2 months ago

@rlskoeser thanks for these notes! I think it makes sense to me, and appreciate the detail. A few initial thoughts:

What does it mean that we're not handling tables? In the example from page 899 you provide, the PPA search result is kind of what I'd expect - string of text without the table formatting. Because the search still comes up, it doesn't seem like a problem to me, but maybe I'm missing something.
Are the "inter" notes and "parend" notes also displayed inline in PPA search? Would we force the text to appear somewhere else (end of page or end of text?), and would that be better? Because it appears in search, it doesn't seem like a problem to me because the user can click on the page image thumbnail for clarification. But I'm curious if you have an idea to make it better, and how easy/hard it would be to implement.
Side note that I tried to check how the "inter" note displays for A65112 and couldn't get any search results at all. Hopefully it's a simple issue like it needs to be reindexed? But thought I'd flag.

rlskoeser commented 2 months ago

ok! figured out a way to query solr to see what the plain text looks like (I tried using the corpus export but that was too slow/inefficient). I'll share how these look in plain text in the format we're generating now to help us decide what (if any) action to take).

tables

Here's what the beginning of the table content that I linked above looks like (example p.899):


The Names.
The Symboles.

The Queene.
1.
EVPHORIS.
1.
A golden tree, laden with fruit.

Co. of Bedford.
AGLAIA.

La. Herbert.
2.
DIAPHANE.
2.
The figure Isocaedron of crystall.

Co. of Derby.
EVCAMPSE.

If we care about how the table looks in plain text, we might want rows to appear on a single line - but I'm honestly not sure we care about this for either the webapp or the nlp work. There are libraries that will generate nicely formatted plain text tables but that seems like overkill for both of our use cases.

Documenting for myself how I'm doing this in django shell in case we need to do this again:

from ppa.archive.solr import PageSearchQuerySet
psq = PageSearchQuerySet()
result = psq.filter(source_id='A04632', label=899).also('content')[0]
print(result['content'])

rlskoeser commented 2 months ago

inline note

Here's an example of the text we're currently generating for inline note (example p. 5). I'm putting the text of the note in bold.

another. So there are several words common to the Turks, Germans,Boxhorn. Origin. Gallic. cap. 6. & 8. Greeks, French, sometimes of the same, and sometimes of several significations; which is not sufficient to argue that all these were of the same Original.

We're currently including the text of the note inline with no spacing or indication that it's a note.

I think we should either omit the note entirely or we need to put it at the end of the page with the number. Omitting is easier; putting it at the end of the page is a small effort.

rlskoeser commented 2 months ago

Following on your side note: I reindexed all the eebo-tcp content and I'm getting page results for the ones I've tested, so hopefully that's no longer a problem.

mnaydan commented 2 months ago

@rlskoeser thanks, the reindexing seems to have solved the problem I was encountering earlier.

I agree that formatting the table otherwise seems overkill. Even with how it's formatted now, say "A golden tree, laden with fruit" was a poetry excerpt, we'd be able to find it as is, no?

I'm conflicted about the notes. I'd love to get EEBO pushed because I think it's ready, but I'm hesitant to omit the notes. For instance, footnote 3.46 on this volume has substantial prosodic discussion and even a poetry excerpt. I'd be fine with leaving it as is (appearing inline) for the web app, but the downstream effects on the full-text corpus have me wondering if it's worth the small effort to put them at the bottom of the page. What do you think?

rlskoeser commented 2 months ago

@mnaydan thanks for finding this - I wondered if there were any like this, I'd only seen the short notes but I know you all have talked about poetry excerpts showing up in footnotes. I agree, this seems worth the effort to include the notes at the bottom of the page text. (This may be useful for other goals too.)

Can we use the same approach for all note types, including marginal notes?

mnaydan commented 2 months ago

@rlskoeser I don't see why not. I think transforming all note types to appear at the bottom of the page makes sense.

mnaydan commented 2 months ago

@rlskoeser do you want to close this issue and create a new one tracking the dev work needed for the change?

rlskoeser commented 2 months ago

@mnaydan you read my mind. Yes, that would be great.

Princeton-CDH / ppa-django