FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

Frankenstein Source Texts: Checking and Versioning #1

Closed ebeshero closed 7 years ago

ebeshero commented 7 years ago

@Rikkm @raffazizzi @scottbot

ebeshero commented 7 years ago
<p>I am by birth a
<app>
    <rdg wit="#c56"><ptr target="http://url.shelley-godwin"/></rdg>
     <rdg wit="#p1818">Genevese</rdg>
      <rdg wit="#p1823">Scotsman</rdg>
      <rdg wit="#p1831">Martian</rdg>
</app>.
</p>
ebeshero commented 7 years ago

@Rikkm @raffazizzi @wendellpiez
Here's a quick summary of the matter of source texts for the Bicentennial Frankenstein edition:

Our goal:

I believe we want to produce a diplomatic edition in which our digital text documents each stage in the transformation of Frankenstein from 1818, to 1823, to 1831. (The old Pennsylvania Electronic Edition nearly did this, and we have their old hand-done collation and notes on the 1823 text to compare.)

Since we have digital texts of 1818 and 1831 (which might need corrections but are probably transcribed mostly accurately), all I'd like us to do here is "eyeball" carefully those texts and enter corrections. And I actually think we should do that with plain text versions that we've extracted. That's because I'd like to build the XML from a collation we'll try to make with CollateX. So I'd like us to start by checking each plain text transcription separately, and then when we're satisfied we have good transcriptions, we'll do a phase of automated collation.

What to do about 1823? From what I'm reading and what I've learned in conversation with people who work with CollateX, it seems best to do collation of all the texts all together, instead of trying to weave a version in later. (This has to do with establishing a basis for comparison and the algorithmic process of comparing each text against the others.) I'd like to prepare a digital text of the 1823 Frankenstein, and @Rikkm I wonder what you think of training OCR to produce a text, and then you and I working on fine-tuning and correcting it (and/or optimizing the training software)? We could reach out for help on this, I think. The question is whether it's better (as in, more efficient) to key in the 1823 ourselves or to scan and correct it in the same kind of work we're doing on the other two texts. Discuss? My two cents: let's try OCR training for a little bit and see if it's not too awful.

I'll push the photo facsimile files here to our GitHub to start us out. If you all agree this is a good idea to proceed, I'd like to start a schedule for perhaps me and Rikk, working together, to a) cross-check the texts and hunt for errors in 1818 and 1831 b) make a digital text of 1823 in whatever way seems most efficient.

Questions/comments? Please record on this GitHub issue rather than by e-mail (easier to track and preserve).

Rikkm commented 7 years ago

I've started investigating the local options we might have for OCR at CMU. Our Archives and digitization people suggest using ABBYY FineReader if we can; I have emailed about the possibility of an open license for version 11 but will not get an answer until after January 2. Alternatives have been offered including Tesseract OCR (OA and available on GitHub and SourceForge).

I've been told that if we can gain access to ABBY FR that training it using 20 pages of the text selected to cover all font variants in the text would be the greatest time investment (a few hours?) and that the software can process the text very quickly once trained with a very high degree (high 90%) of accuracy.

Rikkm commented 7 years ago

The 1823 edition, as Elisa explained in the earlier GitHub issue from 5 days ago "Goals"

ebeshero commented 7 years ago

@mjlavin80 It's in the message(s) I sent over the weekend, but we have been discussing since we started meeting the versioning of three editions of Frankenstein instead of the usual 1818 and 1831. We need to proof the 1818 and 1831 editions from RC anyway because of the questionable way they were ported from their digital source in the HTML Pennsylvania Edition, which glanced at and studied the 1823 later as an afterthought but lacked a way back then (in the 1990s) to incorporate it in its old framed collation. Rikk and I have been hunting for 1823 texts and I just found a good clean digital image that I think we should try to OCR. If we are going to improve on our foundation and use CollateX most efficiently, it will help to do the collation of the three editions all together. We are also curious to see which text it's closest to--the early or the later--and if any of the extreme alterations to 1831 are traceable in the 1823 edition, in which MWS's father, William Godwin, left his mark. There's been a long-running discussion of this novel's complex authorship--reflecting collaboration by Percy Shelley in 1818 and now alterations by William Godwin in 1823, though the Shelley-Godwin Archive's detailed edition of the draft notebooks helps demonstrate exactly how much of a "hand" Percy literally had. Adding 1823 into collation helps chart the fluid morphing of the novel between 1818 and 1831, and should be interesting to see the results.

mjlavin80 commented 7 years ago

I just wonder about the advisability of using an OCR approach the 1823 edition. Could it not conceivably be faster and more accurate to hand correct an already-digitized edition? I know the quantity of textual variance between editions is substantial, but perhaps there's a way to assess that before considering OCR? Has this already been considered? I don't see anything about it on this thread.

ebeshero commented 7 years ago

@mjlavin80 We've been over this. If you can find a digital text of 1823 that would be lovely, but Rikk and I have not located one. What we have (here in the repo) is a facsimile image.

mjlavin80 commented 7 years ago

Not what I'm saying. I'm saying start with digitized 1818 or 1831 (whichever is closer) and hand correct to transform into a representation of the 1823 edition.

ebeshero commented 7 years ago

@mjlavin80 Not a good idea because it will be a lot more trouble to collate the editions separately due to the way collation algorithms work. You missed our conversations planning the edition with MITH and @raffazzizi back in October, too.

mjlavin80 commented 7 years ago

Roger. This should all be summarized in an .md file so it ports with the repo.

ebeshero commented 7 years ago

@mjlavin80 Roger the need for markdown of it in the repo. I feel like I'm writing about it constantly because we've been emailing a LOT about it with Wendell. Rikk and I are working out a method of proof-correcting our texts too facilitated by two people cross-checking each other. Short of keying in the 1823, I wanted to try OCR training followed by our regime of checking--our text here in the repo should be good for it.

mjlavin80 commented 7 years ago

Thanks. Please do understand that am I not asserting an opinion here. I was wondering if others had discussed this and considered alternatives. I'm a latecomer to the project, so I totally understand that some of these discussions happened before I came on board. I'm also happy to defer to the experts on collation methodologies.

ebeshero commented 7 years ago

@mjlavin80 No worries--the edition prep has been in its own thread since we started and the discussion has happened in meetings and over email, so it is high time I prepared a summary file in .md that explains it all! Some of what we are doing is really experimental: we are curious to see how CollateX will work after we tested it on current texts from RC in October (and saw thereby in bold relief the questionable state of those texts).

ebeshero commented 7 years ago

@Rikkm I'm thinking we should give ABBYY a try--what do you think? Is this something we might each try locally? Let me see if Pitt has a license...

Rikkm commented 7 years ago

As I understand it, ABBYY is expensive. I may have access because we have a project that used it and that license may be open, but I won't know until January. I am assuming very local: I may need to go in and use it on a specific machine in our off-site facility near Bakery Square.

ebeshero commented 7 years ago

@Rikkm I don't think Pitt has ABBYY either. Should we just give Tesseract a try, then? I've not used either of these before--so it's all new. I know we're both hitting the road shortly, so I'm looking for an OCR option to play with while traveling--and yes, there is an element of play here: I'm curious to see how well this works and how training of the software works.

This may sound silly, but I've used Adobe Acrobat's built-in OCR years ago when I was helping one of our English majors who used a braille browser access photocopied articles. I remember the corrections, even 6 years ago, weren't awful to prepare.

Suggestion: Try Tesseract and/or Adobe first to see what we think. If these seem like they're going to make too much work for us to correct, go to ABBYY in January.

(I don't mind the proof-correction process, since we need to do it anyway over the other texts, and we're working out a method for that.) I'm going to work on a Markdown file containing our process of edition and collation prep for this repo, synthesizing our discussion so far.

Rikkm commented 7 years ago

I want to follow up on ABBYY regardless in case we develop a proposal to have our fledgling digital center work with our Archives on a project down the road. For now, Dan Evans suggested we look at these: Aletheia / tesseract videos http://emop.tamu.edu/tess-training-demo-vids

ebeshero commented 7 years ago

It's funny that their companion software for Aletheia is called Franken+ . ;-) With a green Frankenstein Creature head.

ebeshero commented 7 years ago

@Rikkm Here's a Simple Task List for Winter Break--and feel free to update/modify it:

ebeshero commented 7 years ago

@wendellpiez @Rikkm @mjlavin80 @scottbot I have some questions about annotations for the whole Bicentennial team as we're starting work on refurbishing the edition. In prepping to pull plain texts out of 1818 and 1831, I'm studying the Romantic Circles and the older Pennsylvania Electronic Edition (basically the same edition in its first phase of existence) and reading the Editorial Principles. (Really, we can refer to them both as Stuart Curran's edition, preserved on Romantic Circles.) They report that the Curran edition was "transcribed from reliable standard editions" though they don't identify which, and they also indicate that they've made a list of silent corrections to printing errors in the 1818 edition. This tells me these texts were carefully prepared and aren't likely to pose much problem for correction. Their edition is also annotated, and we never really decided if we want to preserve those annotations in the new edition.

We're preparing a refurbished new edition to give back to MITH/Romantic Circles, and when we met in October, we decided we'd like this to collate the editions and hold markup pointing into the Shelley-Godwin Archive's MS notebooks. That means we're basically changing the whole architecture of this edition, and a question we never really answered was, what happens to the annotations in Stuart Curran's work? And to what extent can we use them for our own annotation efforts?

I think I'd like to find a way to preserve these, since the annotations characterize the original edition. As I'm working on extracting text, I'm going back to the first HTML edition (which seemed prudent to us--Raff, Rikk, and me--as we examined the documents in October). I wonder if mapping those 1990s notes back into the text together with the hypothes.is work we're now launching is going to be an interesting challenge, or to what extent reviewing the annotations in the Curran edition could save us some effort in the annotation stage?

ebeshero commented 7 years ago

@Rikkm Holiday Update: I've pulled and cleaned up HTML and raw text from the PA Edition of 1818 and 1831, but it's all in little bite-sized pieces (literally). Tomorrow I'll carefully merge those into a single file for each, and we should be ready to start proof correcting against our photo facsimiles.

ebeshero commented 7 years ago

@Rikkm @scottbot @mjlavin80 Just a heads up--you may want to update your repos: I've completed preparation of some handy plain text files holding the 1818 and 1831 editions. Rikk and I should now check these against photo facsimiles and add corrections as needed, but the texts are in a semi-final form. I produced them thinking these plain texts might be useful on their own as a product of the Bicentennial work, an aid to future projects, so I've hammered out a draft header for each file, identifying them as "the Pittsburgh Bicentennial Edition"...Also, I gave it a Creative Commons License--at the moment it's noncommercial share-alike citation-please, but we can change that easily (I was wondering if we're okay with a "Free Culture License"--allowing commercial dev from our work.) Take a look in the Plain_Texts directory and let me know if you have suggestions: https://github.com/ebeshero/Pittsburgh_Frankenstein/tree/master/Plain_Texts

Also there are a couple of new markdown files in the repo documenting decisions I've made and steps I've taken in prepping this. (I've had lots of helpful input from @wendellpiez and David Birnbaum as we're prepping this for collation and to provide a basis for a new LMNL edition Wendell may prepare from our work.)

Next steps: Rikk and I check these against photo facsimiles, and try preparing a full text of the 1823 with ABBYY Finereader. Then, on to collation processing!