earlyprint / earlyprint.github.io

Homepage for the EarlyPrint Project: Curating and Exploring Early Printed English
https://earlyprint.org/
2 stars 2 forks source link

Rejoining Anthologized Drama Texts #14

Open jrladd opened 4 years ago

jrladd commented 4 years ago

A small number of drama texts were split from their original containing anthologies (folios, etc.) as part of the SHC project. While it made sense for that project to have individual files for dramas, the EarlyPrint files should probably follow the printed book (and the EEBO-TCP ids). We need to develop a plan for how to rejoin these files.

However, there are many scholars who are looking for specific texts rather than whole books. How do we make it easy for people to find and work with Macbeth (first published in the 1623 folio) or Lycidas (first published in Justa Edouardo King)? We should be thinking about these possibilities alongside the work of rejoining.

martinmueller39 commented 4 years ago

I’m going to make an argument in favour of leaving plays in their current state. I’m not passionate about it, but either decision involves weighing costs and benefits.

The argument in favour of letting things stand where they are now is that plays are different. They exist in the minds of readers as individual “works” (FRBR). In the case of English texts there is also the fact that there is a century of scholarly labour (Greg, Bentley, Harbage, Wiggins, the DEEP folks) that has clustered around individual plays with their own numbers and bibliographical histories. If we ever add Shakespeare to EarlyPrint we’d probably look for modern transcription (e.g. ISE) rather than work up the TCP text.

For people who are mainly interested in drama, leaving things as they are may be an advantage. There isn’t much of a cost for the folks who work mainly outside of drama. The one exception is the corpus of quasi-plays by Margaret Cavendish. There the original decision to treat these ‘scenes’ as full-length plays was a toss-up in the first place.

There are some edge cases. Combing the half dozen two-part plays would make sense. Gascoigne’s Jocasta and Supposes and Daniel’s Cleopatra and Philotas come from volumes that also have prose or poetry. William Alexander’s Four Monarchical Tragedies were never published as separate texts. And the 1581 edition of Seneca’s was clearly published as an oeuvre. The 1647 folio of Beaumont and Fletcher is definitely a Big Book with lots of paratext. Its content is a motley assembly of plays some of them neither by Beaumont nor Fletcher.

From a technical perspective things are quite simple. Most of these TCP texts used the element. Splitting them was a simple task: it was just matter of turning each into a separate TEI file. The element became a text of its own with a “_00” extension. The page numbers and xmlids of the separate texts follow the image numbers of the EEBO scans.

I think that the benefits of this exception outweigh the costs, but I see the arguments on the other side.

[edited to remove copy of prior email]

jfloewen commented 4 years ago

We’ve talked about this a lot in St. Louis. The table of contents function on Proquest works pretty well for plays; it plainly pulls divs from the XML. I strongly urge that we build the same into the Reader, and contrive other ways to provide a general author and title index that will bring readers seeking individual works to the relevant works.

I beg the group to protect the TCP file and id as the primary unit of access. Please, let’s stick with the plan to rejoin things.

[edited to remove copy of prior email]

dknoxwu commented 4 years ago

A strong argument for undoing the splitting is that it will be the best way to ensure consistent treatment of a corpus of early modern English.

The TCP transcribed printed volumes, and TCP and ESTC metadata relates transcriptions to a properly bibliographic rather than a textual history. Even scholars of drama who are interested in EarlyPrint linguistic annotation may want to be able to explore lemmas and orthographic forms in a way that builds on, rather than frustrates, all they know about print history and the organization of the TCP.

We have heard of scholars who are interested in how dedications illuminate social networks (hi, John). Transcriptions of paratexts are important in reading a single volume but may be no less important for researchers looking for linguistic patterns that happen to be attested in paratexts. Scholars of sermons will already have that information at hand because we aren't tampering with it. Why should that kind of research be at a disadvantage solely in the case of drama?

Maintaining the integrity of the volume also can prevent critical information from disappearing. Right now, the Phase I repository doesn't do full justice to the text of many TCP volumes that included SHC-curated dramas. Gascoigne's A Hundreth Sundrie Flowres (A01513) is now absent except for two plays. Fulke Greville's works published in A02226 are represented by just one play, "Alaham," while treatises and letters are missing, along with a second play, "Mustapha." Researchers interested in verse may be as interested in verse in the treatises as in the plays, and should be able to count on a consistently curated, MorphAdorned organization of the whole.

For the moment a list of split files (identified by underscores in the filename) is available here, drawing on the Bitbucket repository: https://ada.artsci.wustl.edu/register/inventorysplit . Scanning that for gaps in sequence is one way to get an impressionistic sense that non-dramatic segments of Phase I TCP texts have gotten lost somewhere (e.g. A03241_19.xml has no siblings with suffixes below 19).

craigberry commented 4 years ago

I think we would all like to have a corpus that has what the relational database theorists call data independence, i.e., how we store it is not necessarily how we view it.  We would like to satisfy folks interested in the publishing event and also folks interested in the individual work independently from where the document boundaries lie.  This is a worthy goal and I fully support it, but I don't think implementing it will be easy.  The current text filter and the xenoData sections upon which it is based will need a significant redesign to handle sub-documents as well as documents, treating them as equivalent for one use case, but making one fade into the background for a different use case.  This has implications for sharing the text filter among applications and the related desire to save a bookbag, which is something we've discussed doing (are we sharing a list of document identifiers, or work identifiers, or what?).

I also think we would want to preserve the extra curation work that has been done for the split texts.  Taking a random example from Doug's list of split texts, I looked up Richard Brome's Five Plays in the original TCP version:

https://raw.githubusercontent.com/textcreationpartnership/A77565/master/A77565.xml

The header lists 5 different STC numbers and there are 5 different text elements in the document, but nothing except the order in which they appear links a particular STC number with the relevant play, making it difficult to query anything by STC number.  The split texts have this level of metadata inherently disambiguated since there is only one STC number per document.  This is where we would need to redesign the xenoData such that it can refer to the document as a whole or to some part of the document (at least text element level but perhaps other parts as well).

Kate and one or more of her summer crews put a great deal of work into genre identification for the drama.  These genres are work-specific and don't mean much at the book level.  Combining the split texts willy nilly without a way to record a genre for part of a document would erase this work.  I don't think we want to do that.

There may be other things that have been done for the split texts that I'm not thinking of right now.

So I think the way forward would look something like this:

1.) Redesign the epHeader xenoData to accommodate more than one text in a document.  It might be as simple as storing multiple epHeader sections, one for each text, with some kind of pointer to the xml:id of the relevant text. But there would surely be some metadata that refers to the book as a whole, so we would need to allow for that as well.

2.) Build the new metadata for the documents that have not previously been split.  This will involve work to identify titles, STC numbers, and possibly other things for the relevant sub-documents.

3.) Combine the currently split texts (restoring any missing chunks), flagging the metadata so it's clear which sub-document each set refers to since it will no longer refer to an entire document.

4.) Rewrite the text filter and text browse features to accommodate the new data design.  It might start with a "Filter By" option where the user would have to choose between "work" and "book."  But of course we'll never agree on what the default should be so we'll need sticky user preferences :-).

On October 15, 2019 at 2:02 PM, dknoxwu notifications@github.com wrote:

A strong argument for undoing the splitting is that it will be the best way to ensure consistent treatment of a corpus of early modern English.

The TCP transcribed printed volumes, and TCP and ESTC metadata relates transcriptions to a properly bibliographic rather than a textual history. Even scholars of drama who are interested in EarlyPrint linguistic annotation may want to be able to explore lemmas and orthographic forms in a way that builds on, rather than frustrates, all they know about print history and the organization of the TCP.

We have heard of scholars who are interested in how dedications illuminate social networks (hi, John). Transcriptions of paratexts are important in reading a single volume but may be no less important for researchers looking for linguistic patterns that happen to be attested in paratexts. Scholars of sermons will already have that information at hand because we aren't tampering with it. Why should that kind of research be at a disadvantage solely in the case of drama?

Maintaining the integrity of the volume also can prevent critical information from disappearing. Right now, the Phase I repository doesn't do full justice to the text of many TCP volumes that included SHC-curated dramas. Gascoigne's A Hundreth Sundrie Flowres (A01513) is now absent except for two plays. Fulke Greville's works published in A02226 are represented by just one play, "Alaham," while treatises and letters are missing, along with a second play, "Mustapha." Researchers interested in verse may be as interested in verse in the treatises as in the plays, and should be able to count on a consistently curated, MorphAdorned organization of the whole.

For the moment a list of split files (identified by underscores in the filename) is available here, drawing on the Bitbucket repository: https://ada.artsci.wustl.edu/register/inventorysplit . Scanning that for gaps in sequence is one way to get an impressionistic sense that non-dramatic segments of Phase I TCP texts have gotten lost somewhere (e.g. A03241_19.xml has no siblings with suffixes below 19).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

pibburns commented 4 years ago

Craig A. Berry wrote on 10/15/2019 4:35 PM:

There may be other things that have been done for the split texts that I'm not thinking of right now.

Plays are a little world unto themselves in many ways.  After all, they were intended to be performed, not just read.

One of the student crews started (but I don't believe completed) gathering the URLs for the DEEP project entries for each of the plays. The intent was to add the DEEP URL to each play's epheader.  That is worth completing, and presents a useful project for motivated students.

http://deep.sas.upenn.edu/

Personally I am rather interested in pointing to freely available performances of the incidental music for the plays for which the music survives.   Perhaps we might be able to interest folks in the music school to produce such performances.   A good example would be Henry Purcell's music for the 1695 revival of Aphra Ben's Abdelazar.

Beyond that, I concur with the rest of Craig's comments.   Fixing up the texts at token and structural markup levels has been (and continues to be) an enormous task.    We should not forget that there remains a large amount of work to do at the metadata level.   And of course, Phase II looms and a lot of work we did for Phase I needs repeating in kind for the Phase II texts.

-- Philip R. "Pib" Burns     Research Computing Services     Northwestern University, Evanston, IL.  USA     pib@northwestern.edu