Write the mapping to read EEBO-TCP XML structure

mnaydan commented 5 months ago

updates after code review

[x] relabel pb as page beginning
[x] check TCP content for div2 types
[x] check if pages ever span div1 or div2 containers
[x] check for any other non-text elements like GAP
[x] remove divider character from text

rlskoeser commented 4 months ago

@mnaydan sharing some details based on my work with EEBO-TCP XML so far; we have some decisions to make about how we'll want to handle these.

Gaps. The xml includes markup where content is illegible or (weirdly) foreign; how do we want to handle these? Some examples:
```
... hey stand in Competit<GAP DESC="illegible" RESP="apex" EXTENT="1 letter" DISP="•"/>on with Truth ...
```
```
which are call'd <GAP DESC="foreign" DISP="〈 in non-Latin alphabet 〉"/>;
```
Right now I'm displaying the contents of the DISP attribute as is, but I'm not sure that's what we actually want. I can check what other kinds of gaps there are if that's helpful; my quick search showed these and then a "duplicate" for a repeated page.

I'm working with the P4 xml, which is what Paul Schaffner recommended - but we could see if the texts we're interested are available in the P5 versions, which will be better quality text. I haven't searched exhaustively, but a quick check shows no GAP in the same set of content (the A2 files).

Divider. There are divider characters in the text to indicate line breaks in the original, or maybe they are only end of line hyphens and we don't have all the original line breaks. We'll need to decide how to handle (convert back to hyphen and line break? strip out?)

That being probable that for the most part and most usually happensto be; not simply, as some would have it to be; but as being that, which in those things that may be otherwise, has the same relation to Probable, as universal to parti∣cular. Of Signs there are some that have the same Relation one to another, as singular to U∣niversal; others, as something Universal to Par∣ticular.

Here's the same paragraph in the P5 markup: (I added line breaks to make it more readable on github)

That being probable that for the moſt part and moſt uſually happensto be; not ſimply, as ſome 
would have it to be; but as being that, which in thoſe things that may be otherwiſe, has the
ſame relation to Probable, as univerſal to parti<g ref="char:EOLhyphen"/>cular. Of Signs 
there are ſome that have the ſame Relation one to another, as ſingular to U<g ref="char:EOLhyphen"/>niverſal; 
others, as ſomething Univerſal to Par<g ref="char:EOLhyphen"/>ticular. Of theſe ſome are neceſſary, 
which are call'd <gap reason="foreign">
                 <desc>〈 in non-Latin alphabet 〉</desc>
              </gap>; but ſuch, as not neceſſary, have no name according to this Diſtinction. I
call thoſe neceſſary, out of which a Syllogiſm is Compos'd; which is therefore call'd an Argument. 
For when they believe there can be no contradiction of the thing Propounded, then they think 
they have brought a <gap reason="foreign">

Section types: some divs in the xml have types like license, title page, coat of arms, dedication, preface. I was thinking to carry these through as page labels like we do with the HathiTrust tags, but I mention in case you want to filter some kinds of content out entirely.

rlskoeser commented 4 months ago

Ran some xqueries across all the xml files in phase 1 A0 set of files to try to answer questions Laure raised in code review.

Attaching two text files with list of div1 and div2 types used across these set of files.

eebo_tpc_div1_types.txt eebo_tpc_div2_types.txt

Have confirmed that pages can span div1 boundaries.

rlskoeser commented 4 months ago

There are other tags like GAP that can occur but do not contain text. I don't think any of them impact how we pull text from the xml. The ones I found:

milestone
figure

My query for empty tags also turned up gaps inside note, seg, and abbr tags. At least one of the notes seems to be a marginal note (place='marg') with mathematical notation (indicated by the gap type).

I'll run these queries again once I have a subset of only the EEBO-TCP volumes we care about.

rlskoeser commented 4 months ago

Mary and I discussed and decided that the other refinements can wait; the additional page types don't affect the webapp and we don't know for sure that we'll use them in the NLP corpus. We can always circle back to refine them when we know if we need them.

We'll need to decide if we want to do anything different with gap text, but we can make that decision and refine once we've looked at how it shows up in our subset of content.

rlskoeser commented 4 months ago

This functionality isn't directly user testable, only as part of the import; since the mapping is merged in to develop I'm going to close this.

Princeton-CDH / ppa-django

Write the mapping to read EEBO-TCP XML structure #641

updates after code review