FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

our embedded xml tags in the plain-text edition #8

Closed ebeshero closed 7 years ago

ebeshero commented 7 years ago

@Rikkm I'm doing a round of work on 1831 that caused me to check how we'd prepped the same chunk of text in the 1818 and 1823. And I suddenly discovered something we marked differently in the 1818 than in the other two texts. I bet we talked about it, because we're doing things right in the 1823 and 1831, but we'll need to correct the 1818 for the following issue:

Here's a sample of the 1818, with hard returns around the <pb/> element:

text text text he embarks 

<pb xml:id="1818_v1_016" n="v1_004"/>

in a little boat, text text text 

This is problematic because hard returns actually signal new paragraphs in our pseudo-plain text edition (that's the only signal we have in the plain text file for a new paragraph). So, what we want to do, as we're doing in the other two texts should be like this:

text text text he embarks<pb xml:id="1818_v1_016" n="v1_004"/> in a little boat, text text text 

I would be tempted to do a find and replace and remove all the returns around <pb> elements, except that some of them maybe (probably, almost certainly) do come at paragraph endings. I know that I can find all the <pb/>'s that come between words that start with lower-case letters, or that come after or before commas, semicolons, colons (etc), and that will catch most of these in an auto-correction, but where a page-break comes after an end-stopped punctuation mark, we need to check it against the 1818 photofacsimile.

I wonder if I should stop and try to partially correct these right now, or save it for a pass of final proofing next week? What think you?

ebeshero commented 7 years ago

Also, one thing I'm now explicitly altering across all three texts is the way we deal with footnotes to the poetry embedded in the novel. This is following the same principle of not disrupting the signaling of semantic paragraphs: so we don't want to have a hard return in the middle of a paragraph, followed by a footnote and then a page break and a continuation of the same paragraph.

How to deal with this? I'm actually introducing a <note> element and positioning the notes right after what they annotate, rather than attempting to position them according to their page layout. This is a convention of the TEI for semantic encoding rather than prioritizing the layout of the page. (We can signal in an attribute that it is the author's footnote to the passage in question, and eventually render it where we choose on the digital edition.)

Basically, we need to make sure we're writing markup in a consistent way across the three texts so as to accomplish a clean and unambiguous collated document...which I'm getting eager to see...:-)

ebeshero commented 7 years ago

And...yikes! I've had to enter three whole lines of text that were missing in the PAEE 1831 edition! Good thing we're correcting these...

mjlavin80 commented 7 years ago

https://cdn.meme.am/instances/66079643.jpg

From: Rikk Mulligan [mailto:notifications@github.com] Sent: Thursday, April 27, 2017 9:16 PM To: ebeshero/Pittsburgh_Frankenstein Pittsburgh_Frankenstein@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [ebeshero/Pittsburgh_Frankenstein] our embedded xml tags in the plain-text edition (#8)

push done. status up to date.

On Thu, Apr 27, 2017 at 9:15 PM, Elisa Beshero-Bondar < notifications@github.commailto:notifications@github.com> wrote:

YES. git push please. Let's get your changes into the remote branch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/8#issuecomment-297881265, or mute the thread https://github.com/notifications/unsubscribe-auth/ASoxxna99Xqgmpx_y4MOad5RuDGsYTuHks5r0T2wgaJpZM4NK6V8 .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Febeshero%2FPittsburgh_Frankenstein%2Fissues%2F8%23issuecomment-297881349&data=01%7C01%7Clavin%40pitt.edu%7Ca24aa1e99f78427e473c08d48dd4260e%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=NDzVDQ3X1Gc3M2NgSNk6uvm5MUnXfGfrZBAQB6xwGKE%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAD_1ouU302jY1H2N9pQQ59ijBaJ18dW_ks5r0T3XgaJpZM4NK6V8&data=01%7C01%7Clavin%40pitt.edu%7Ca24aa1e99f78427e473c08d48dd4260e%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=OPYOF2y4LQqwExYGLZHY0x%2FesXlidBqY7Rd4%2F38g1so%3D&reserved=0.

ebeshero commented 7 years ago

@Rikkm I've pulled in the changes on Text_Processing, but I'm confused by your first commit. It looks like you moved <pb/> elements outside of the words, when page-breaks fell inside a word.

As I've been marking these, all along, I've placed the <pb/> elements exactly where they fall, even when they fall inside a word. They're a distinct feature of each edition. But we've silently been deleting the hyphenation that falls around the page-breaks. Since we are preparing a collated edition, we're prioritizing semantic breaks rather than page layout: the page-breaks are incidental, but we keep them eventually to pair up facsimile images with the text. I hope we can put those <pb/> elements back exactly where they fall, if we're going to be using them. We'd be using them to anchor the photofacsimiles to locations in the text, and I think the <pb/> elements should silently and accurately reflect where words break across pages (though we don't need to mimic the way the typesetters broke the words).

Really, it probably doesn't matter, but it bothers me that we've been inconsistent about this. Priority 1 for the collation is to make sure the paragraph units are clearly signalled (and that we don't have too many of them). These are signalled by two hard returns--and that's what we need to repair in the 1818 edition.

Rikkm commented 7 years ago

Elisa -- umm -- we made a decision back in January to move the to the end of the word and attached punctuation. This is what we have listed in the EditingFrankenstein.md

5) Formerly hyphenated words that cross page breaks will be joined on the previous page before the is added to the text file. Any immediate punctuation will be retained, most typically commas and periods.

So, yes, I've been doing this purposely, and would need to go back and speed check the ends of pages to correct if I misunderstood our policy.

On Thu, Apr 27, 2017 at 9:53 PM, Elisa Beshero-Bondar < notifications@github.com> wrote:

@Rikkm https://github.com/Rikkm I've pulled in the changes on Text_Processing, but I'm confused by your first commit. It looks like you moved elements outside of the words, when page-breaks fell inside a word.

As I've been marking these, all along, I've place the elements exactly where they fall, even when they fall inside a word. They're a distinct feature of each edition. But we've silently been deleting the hyphenation that falls around the page-breaks. Since we are preparing a collated edition, we're prioritizing semantic breaks rather than page layout: the page-breaks are incidental, but we keep them eventually to pair up facsimile images with the text. I hope we can put those elements back exactly where they fall, if we're going to be using them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/8#issuecomment-297886088, or mute the thread https://github.com/notifications/unsubscribe-auth/ASoxxiQVhH0i_CYSeO3aUY6nWnMBAabbks5r0UaVgaJpZM4NK6V8 .

ebeshero commented 7 years ago

@Rikkm That's not how I read point 5:

5) Formerly hyphenated words that cross page breaks will be joined on the previous page 
before the <pb> is added to the text file. Any immediate punctuation will be retained, 
most typically commas and periods.

As I understood this, we are joining the hyphenated words, but still positioning the <pb/> inside them, because the <pb/> as a milestone element isn't disrupting anything, and we aren't using it for layout. We are only changing the punctuation, not changing where we indicate a page-break falls. That has been my understanding since we implemented that policy. And it's consistent with my practice on other TEI projects...

I'm sorry this turned out to be an ambiguous policy! What can we do to repair it? I think we need to do a round a proofing of those <pb/> elements anyway, across the board.

Rikkm commented 7 years ago

so I'm just making sure:

all the edits I just corrected in 1823, again, like:

page 1: ...Peters page 2: burg...

I've been doing: Petersburg,

But you're saying these should be: Petersburg

in which case I need to fix 1818 and the entire first volume of 1823.

On Thu, Apr 27, 2017 at 10:03 PM, Elisa Beshero-Bondar < notifications@github.com> wrote:

@Rikkm https://github.com/Rikkm That's not how I read point 5:

5) Formerly hyphenated words that cross page breaks will be joined on the previous page before the is added to the text file. Any immediate punctuation will be retained, most typically commas and periods.

As I understood this, we are joining the hyphenated words, but still positioning the inside them, because the as a milestone element isn't disrupting anything, and we aren't using it for layout. We are only changing the punctuation, not changing where we indicate a page-break falls. That has been my understanding since we implemented that policy. And it's consistent with my practice on other TEI projects...

I'm sorry this turned out to be an ambiguous policy! What can we do to repair it? I think we need to round a proofing of those elements anyway, across the board.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/8#issuecomment-297887345, or mute the thread https://github.com/notifications/unsubscribe-auth/ASoxxlblYtcAAYdjx3MtmwwYSiQ9aXGXks5r0UjigaJpZM4NK6V8 .

ebeshero commented 7 years ago

@Rikkm Yes, that's it. Here's what happens:

Peters<pb xml:id="1818_v1_025"/>burg is literally the same word as Petersburg <pb xml:id="1823_v1_018"/>

and we instruct collateX to read around those tags, but we still want to keep them where they fall, if preserving page-break information from the three editions is important to us. Since we've gone to the trouble of positioning the <pb/> elements, and we've been planning to pair images to text, we can use that information ultimately when we're building the web edition--and even if it looks weird, it's still indicating exactly what the hyphen at the end of a word on the page does. That hyphen is what we'd call "pseudomarkup" (or even markup) made by a typesetter to indicate exactly what our angle-bracket markup is doing.

Rikkm commented 7 years ago

wonderful.

ok. I'll get to fixing them. It won't be tonight.

On Thu, Apr 27, 2017 at 10:15 PM, Elisa Beshero-Bondar < notifications@github.com> wrote:

@Rikkm https://github.com/Rikkm Yes, that's it. Here's what happens:

Petersburg is literally the same word as Petersburg

and we instruct collateX to read around those tags, but we still want to keep them where they fall, if preserving page-break information from the three editions is important to us. Since we've gone to the trouble of positioning the ` elements, and we've been planning to pair images to text, we can use that information ultimately when we're building the web edition--and even if it looks weird, it's still indicating exactly what the hyphen at the end of a word on the page does. That hyphen is what we'd call "pseudomarkup" (or even markup) made by a typesetter to indicate exactly what our angle-bracket markup is doing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/8#issuecomment-297888985, or mute the thread https://github.com/notifications/unsubscribe-auth/ASoxxgF1rIGgKITHI8bhyhpnZM6Y5f4kks5r0UvGgaJpZM4NK6V8 .

ebeshero commented 7 years ago

I think we should just take note at this point (right now), that we have interpreted <pb/> positioning differently, and table that until we get through correcting the texts. I'll volunteer to take on fixing the 1818 pagebreaks if you can do the first vol. of 1823. I'm sorry we didn't notice this discrepancy in our methods sooner!

Meanwhile, just getting corrected texts is a lot more important right now.

Rikkm commented 7 years ago

fair enough.

I can fix the 1823 as I complete it. I had hoped to finish it this weekend, BUT -- we have a major meeting on May 8. I'm part of a group giving a report on materials that have not yet been completed by others in our group. So I need to split time because our deadline for consolidating the slidedeck is next Thursday.

Sorry that I misunderstood -- I was thinking the would be an actual pagebreak in the browser...

On Thu, Apr 27, 2017 at 10:18 PM, Elisa Beshero-Bondar < notifications@github.com> wrote:

I think we should just take note at this point (right now), that we have interpreted positioning differently, and table that until we get through correcting the texts. I can take on fixing the 1818 pagebreaks if you can do the 1823. I'm sorry we didn't notice this discrepancy in our methods sooner!

Meanwhile, just getting corrected texts is a lot more important right now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/8#issuecomment-297889413, or mute the thread https://github.com/notifications/unsubscribe-auth/ASoxxiCLENx2WpQRehleL0zH1od_170gks5r0UyBgaJpZM4NK6V8 .

ebeshero commented 7 years ago

@Rikkm I know--it's a reasonable idea, and you're thinking about the web design. I guess I'm trained and honed to think of TEI as informational and descriptive--though at our NEH Institute in July, Hugh Cayless will almost certainly be showing us how we can preserve all that information we're going to be accumulating on the web--without oversimplifying the TEI by converting it to simpler forms in standard HTML elements. (For a sneak preview, see his CeTEIcean project!)

Rikkm commented 7 years ago

and thus I have A LOT to learn at the workshop. I'm learning a lot by doing this work with you -- if things go one way I'm hoping we might take on digital editions through our hoped-for library press imprint at CMU. In which case our workflows and editorial processes with benefit from our (your and mine) miscommunications.

I still need to get back to Python by mid-May or I may be screwed in July.

On Thu, Apr 27, 2017 at 10:29 PM, Elisa Beshero-Bondar < notifications@github.com> wrote:

@Rikkm https://github.com/Rikkm I know--it's a reasonable idea, and you're thinking about the web design. I guess I'm trained and honed to think of TEI as informational and descriptive--though at our NEH Institute in July, Hugh Cayless will almost certainly be showing us how we can preserve all that information we're going to be accumulating on the web--without oversimplifying the TEI by converting it to simpler forms in standard HTML elements. (For a sneak preview, see his CeTEIcean https://github.com/TEIC/CETEIcean project!)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/8#issuecomment-297890798, or mute the thread https://github.com/notifications/unsubscribe-auth/ASoxxr0wbRWKkuWe8MhlUszY6hwMATKwks5r0U8ZgaJpZM4NK6V8 .

ebeshero commented 7 years ago

@Rikkm Agreed--and what's good about all this is even when we have an occasional slightly stressful mixup, we get it sorted out! :-) And we all have a lot to learn from that July institute...it's very exciting. :-)

Nearer-term, I'm balancing this around Pitt's end-of-semester grading, which is due next Wed. I'm purposely doing a binge of Frankenstein work now to help get us "over the hump". If we can actually finish text correction of all three by (say) Thurs. night next week, I think I can try getting the collateX machine cranking on our edition and have some collations to show at our meeting on Sat. 6 May. That is the goal...but even if we can't, we need to get our texts collated and packaged up by the last week of May when I'll be visiting with Raff and Wendell in Maryland...I hope we can hand off our collated edition to Raff so he can work on pointing it into the Shelley-Godwin Frankenstein notebook edition of the first manuscript.