FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

Original HTML Frankenstein Texts? #3

Closed ebeshero closed 7 years ago

ebeshero commented 7 years ago

@scottbot @Rikkm @raffazizzi @daverett @fraistat Hi Raff, Dave, and Neil: When we spoke on Skype just before Halloween, we discussed how we probably ought to be working from the original HTML texts, which hold variant markup and annotations, etc. It's this edition, right? http://knarf.english.upenn.edu/

I could download the pages from the website, but I imagine you have them in a tidy bundle over there: Can you all over at MITH push the HTML files to this repo in the next week or so, so we can get started? Or send it to us any way that's convenient?

fraistat commented 7 years ago

Raff--Would you please follow up on this?--Thanks, Neil

On Sun, Nov 13, 2016 at 11:25 AM, Elisa Beshero-Bondar < notifications@github.com> wrote:

@scottbot https://github.com/scottbot @Rikkm https://github.com/Rikkm @raffazizzi https://github.com/raffazizzi @daverett https://github.com/daverett @fraistat https://github.com/fraistat Hi Raff, Dave, and Neil: When we spoke on Skype just before Halloween, we discussed how we probably ought to be working from the original HTML texts, which hold variant markup and annotations, etc. It's this edition, right? http://knarf.english.upenn.edu/

I could download the pages from the website, but I imagine you have them in a tidy bundle over there: Can you all over at MITH push the HTML files to this repo in the next week or so, so we can get started? Or send it to us any way that's convenient?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY0mTfF6Gcy82ChfHsLAlUONjko-Y4Sks5q9znygaJpZM4Kwuu0 .

Neil Fraistat Professor of English & Director Maryland Institute for Technology in the Humanities (MITH) 0301 Hornbake Library North University of Maryland 301-405-5896 or 301-314-7111 (fax) http://www.mith.umd.edu/ https://twitter.com/fraistat

raffazizzi commented 7 years ago

@daverett do you have these files?

daverett commented 7 years ago

I've just put together a package exporter for Frankenstein HTML directly from the current Romantic Circles database. I think this should probably be the fileset two work from instead of the original Upenn edition (though that is the right one, @ebeshero).

The dump contains all the chapters, letters, prefaces, etc. that comprise both the 1818 and 1831 editions of the novel. It is in a custom xml format with the HTML body wrapped in CDATA. I could work on an XSLT to split them apart as separate files if you like but thought this might be something you could work with just as well. Additionally, I'd like to keep the "nid" element present as a unique identifier of sorts so that if and when we are able to create XML files that can take the place of the current HTML-only Frankenstein files, I'll have a solid way of overwriting existing content. Hope that makes sense.

Finally, the document is in a simple raw order starting with the title page of 1818 and ending with the last chapter of 1831.

daverett commented 7 years ago

RC-Frankenstein.xml.zip

ebeshero commented 7 years ago

@daverett Thanks, Dave! I am finally looking at your file and have some questions. I'm curious about the requirements of your database, and whether/how your database could work compatibly with TEI P5 app crit markup, which we're planning to do in collating the 1818 with the 1831 edition. In collating the two editions, we'd be weaving them together into a single versioned text that represents where 1831 diverges from 1818. Basically the structure of the XML wouldn't be organized like this--it would be like shuffling the two editions together into a giant genetic edition. We were planning to add stand-off markup to point to relevant locations in the ms draft notebooks at the Shelley-Godwin archive, and I suppose we could do something similar if we had to keep the 1831 text separate, but I'm not sure that's the best plan here. (Here's a sample of how the markup would look: https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/1 )

So, if we're collating this way, I'm not sure how it would make sense to preserve the content/function of the <nid> elements that mark off sections of the 1818 and 1831 texts. Would it be okay to preserve them in the 1818 text, as we fold the pieces of the 1831 text into that document? If we do that, I could see changing the <nid> into TEI <idno> elements, or values of @xml:ids set on structural pieces of the 1818 text.

One thing I'm curious about is how the collation and alignment is signalled on the site with the old 1990s frames. So, here's a sample collation view: http://knarf.english.upenn.edu/Colv1/fprfc1.html Is the way to build that original collation coded or signalled inside the database file you've sent here? Is there some code in here signalling relationships across the <nid> elements to create the collation? Or something else going on?

We began processing the 1818 and 1831 texts with collateX to generate as best we can the precise points of deviation and alignment between the two editions--so we were expecting to be totally redoing the original collation. But I am wondering now if it might make sense to work with the original edition's collation frames as discrete "chunks" to feed into collateX, since it probably works better over smaller pieces at a time. If you can tell us more about how the old frame-collation was constructed or generated that might be helpful!

daverett commented 7 years ago

Hi @ebeshero, I don't wish to overcomplicate things with the preservation of the ids, esp. given your explanation. I was under the impression that one final iteration of the TEI work would be one-file-per-chapter reading text versions for mounting directly at RC, effectively replacing the HTML-only files we already have. In that case, preserving the nids would be useful to overwrite the existing files, but it is merely for convenience. It would be trivial to add those back in later, should we need.

As for the frames on the old site, I'm not sure I can give you a good answer. Digging around our old filesystem, I can't find those anywhere, probably because we didn't reproduce these collations in the RC edition. My strong hunch, though, is that all the collations were marked up entirely manually. I don't think there's any reason to expect that those original 90s frames have any kind of signalling relationships, nor do our files. Just raw HTML of the text of the novel.

I can get together chapter-by-chapter HTML files for each edition, if that would help. I'm operating under the assumption, by the way, that the Frankenstein edition as produced in RC may have some corrections that the original does not, which is why I'm suggesting using the HTML from our database rather than the old.

daverett commented 7 years ago

Hi again @ebeshero, just responding quickly here with an archive of HTML files I generated from the XML dump. Maybe this will be of use.

A quick note about the filenames: As you can probably glean, each is a series of integers, with the first standing for the edition, the second the volume, and the third a raw order no., i.e. from the title page in 1818 to the last chapter in 1831.

RC-Frankenstein-HTML.zip

ebeshero commented 7 years ago

@daverett Thanks for digging up these files! I take your point about the old HTML and the likelihood of its lacking corrections, so we'll work with what you've sent.