Closed rlskoeser closed 2 months ago
inline elements:
<MILESTONE N="5" UNIT="argument key no."/>
stage
tag (?) with place=marg (margin)lb
tag (line beginning)elements we're not handling;
Here's a list of all the unique div types that occur in our subset of documents (have not restricted it to page ranges for excerpts):
Here are all the note types that occur in our subset:
<PB N="xxv" REF="163"/>
; typo?)Here's an example marginal note:
handling of that subject,<NOTE PLACE="marg">§. I.</NOTE> I have here proposed
Here's an example of an "inter" note (also displayed at end): https://quod.lib.umich.edu/e/eebo/A65112.0001.001/1:5?rgn=div1;view=fulltext#backDLPS2
Example of "parend" note, also displayed at end: see page 152 https://quod.lib.umich.edu/e/eebo/A29229.0001.001/1:8?rgn=div1;view=fulltext#backDLPS3
@rlskoeser thanks for these notes! I think it makes sense to me, and appreciate the detail. A few initial thoughts:
ok! figured out a way to query solr to see what the plain text looks like (I tried using the corpus export but that was too slow/inefficient). I'll share how these look in plain text in the format we're generating now to help us decide what (if any) action to take).
Here's what the beginning of the table content that I linked above looks like (example p.899):
The Names.
The Symboles.
The Queene.
1.
EVPHORIS.
1.
A golden tree, laden with fruit.
Co. of Bedford.
AGLAIA.
La. Herbert.
2.
DIAPHANE.
2.
The figure Isocaedron of crystall.
Co. of Derby.
EVCAMPSE.
If we care about how the table looks in plain text, we might want rows to appear on a single line - but I'm honestly not sure we care about this for either the webapp or the nlp work. There are libraries that will generate nicely formatted plain text tables but that seems like overkill for both of our use cases.
Documenting for myself how I'm doing this in django shell in case we need to do this again:
from ppa.archive.solr import PageSearchQuerySet
psq = PageSearchQuerySet()
result = psq.filter(source_id='A04632', label=899).also('content')[0]
print(result['content'])
Here's an example of the text we're currently generating for inline note (example p. 5). I'm putting the text of the note in bold.
another. So there are several words common to the Turks, Germans,Boxhorn. Origin. Gallic. cap. 6. & 8. Greeks, French, sometimes of the same, and sometimes of several significations; which is not sufficient to argue that all these were of the same Original.
We're currently including the text of the note inline with no spacing or indication that it's a note.
I think we should either omit the note entirely or we need to put it at the end of the page with the number. Omitting is easier; putting it at the end of the page is a small effort.
Following on your side note: I reindexed all the eebo-tcp content and I'm getting page results for the ones I've tested, so hopefully that's no longer a problem.
@rlskoeser thanks, the reindexing seems to have solved the problem I was encountering earlier.
I agree that formatting the table otherwise seems overkill. Even with how it's formatted now, say "A golden tree, laden with fruit" was a poetry excerpt, we'd be able to find it as is, no?
I'm conflicted about the notes. I'd love to get EEBO pushed because I think it's ready, but I'm hesitant to omit the notes. For instance, footnote 3.46 on this volume has substantial prosodic discussion and even a poetry excerpt. I'd be fine with leaving it as is (appearing inline) for the web app, but the downstream effects on the full-text corpus have me wondering if it's worth the small effort to put them at the bottom of the page. What do you think?
@mnaydan thanks for finding this - I wondered if there were any like this, I'd only seen the short notes but I know you all have talked about poetry excerpts showing up in footnotes. I agree, this seems worth the effort to include the notes at the bottom of the page text. (This may be useful for other goals too.)
Can we use the same approach for all note types, including marginal notes?
@rlskoeser I don't see why not. I think transforming all note types to appear at the bottom of the page makes sense.
@rlskoeser do you want to close this issue and create a new one tracking the dev work needed for the change?
@mnaydan you read my mind. Yes, that would be great.
rerun xqueries on ppa subset of eebo-tcp (div types; check for notes, empty tags, etc)
review instructions
Next steps will be to determine which elements we want to handle differently when converting from TEI to plain text.