FrankensteinVariorum / fv-collation

first-stage collation processing in the Frankenstein Variorum Project. For post processing and Variorum development, see our GitHub organization: https://github.com/FrankensteinVariorum
https://frankensteinvariorum.github.io/fv-collation/
GNU Affero General Public License v3.0
9 stars 2 forks source link

CollateX #2

Closed ebeshero closed 6 years ago

ebeshero commented 7 years ago

@raffazizzi Tutorial for working with CollateX locally with Python http://collatex.obdurodon.org/

ebeshero commented 7 years ago

How to run chunks of Frankenstein through CollateX for TEI app-crit output:

1) Produce JSON input file (using XSLT or XQuery), holding base text of the witnesses 2) Download CollateX java jar locally from here: http://collatex.net/download/

3) At command line, in the CollateX directory, run:

java -jar -Xmx6g collatex-tools-1.7.1.jar ../GitHub/Pittsburgh_Frankenstein/CollateX/FrankenCollatV1ch1.json -f tei  -o ../GitHub/Pittsburgh_Frankenstein/CollateX/collatedFrankV1Ch1.xml

This takes input JSON and outputs TEI.

ebeshero commented 7 years ago

@Rikkm An update on collation: @djbpitt and I worked on normalization and collation of a portion of all three texts yesterday, and we've got a Python script worked out for this on my branch that we'll keep refining. I can merge this with our Text_Processing branch, but I also wanted to check in with you as we're prepping for the Saturday meeting. We noticed a single-letter typo that I want to correct in Ch. 1 of the 1823 text (my error, I bet, b/c I was editing that part)--but I know you're working on the file. Is it safe for me to go in and push a tiny correction on the 1823 text over to Text_Processing, together with our new collation stuff?

Rikkm commented 7 years ago

II added some edits last night, they've been pushed up already, so you should be fine. I'll make sure to do a pull before I begin working on it again this evening.

On Thu, May 4, 2017 at 7:00 PM, Elisa Beshero-Bondar < notifications@github.com> wrote:

@Rikkm https://github.com/Rikkm An update on collation: @djbpitt https://github.com/djbpitt and I worked on normalization and collation of a portion of all three texts yesterday, and we've got a Python script worked out for this on my branch that we'll keep refining. I can merge this with our Text_Processing branch, but I also wanted to check in with you as we're prepping for the Saturday meeting. I noticed in a single-letter typo I want to correct in Ch. 1 of the 1823 text (my error I bet b/c I was editing that part)--but I know you're working on the file. Is it safe for me to go in a push a tiny correction on that text over to Text_Processing, together with our new collation stuff?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/2#issuecomment-299332422, or mute the thread https://github.com/notifications/unsubscribe-auth/ASoxxmWg-hGC1CedsQnE6y63tMb1gg6oks5r2lidgaJpZM4KixIw .

ebeshero commented 7 years ago

@Rikkm Good--yes, I just pulled those in. I'll make the change and then merge my branch. I'm going to blitz through as much of 1831 as I can after that. Let me know how close you're getting to the finish of 1823!

Okay, just to get us ready for the next stage: We're going to need to do a little prep to help align the three texts for collation:

ebeshero commented 7 years ago

@Rikkm See preliminary output here: https://github.com/ebeshero/Pittsburgh_Frankenstein/blob/ebeshero_Exp/collateXPrep/outputEBB.txt

Rikkm commented 7 years ago

I've just gotten home, so I'm eating then will be working on 1823 from 8-9 pm, then I have a call, and probably again from 10 - 11 or 12. I don't think I'll be able to finish 1823 tonight, but I'll try to finish it tomorrow morning or before noon.

I then have a big thing going on Monday, for which I am coordinating the group, so I don't think I'll have any more bandwidth Friday or Saturday after our meeting, but I'll update you Saturday.

On Thu, May 4, 2017 at 7:17 PM, Elisa Beshero-Bondar < notifications@github.com> wrote:

@Rikkm https://github.com/Rikkm Good--yes, I just pulled those in. I'll make the change and then merge my branch. I'm going to blitz through as much of 1831 as I can after that. Let me know how close you're getting to the finish of 1823!

Okay, just to get us ready for the next stage: We're going to need to do a little prep to help align the three texts for collation:

  • One thing we worked out is that I'll do an up-conversion of all three documents to very simple XML before doing the collation (it's just easier to use Python's XML parser than to try to use its regular expression matching on our pseudo-markup).
  • And we'll be processing the texts in segments that we need to define: The texts are too long to process with collateX all at once (that would introduce alignment errors), so we'll want to inspect the texts and identify some locations where they line up, where we can drop markers--this doesn't need to be paragraph-by-paragraph; more like chapter-by-chapter, taking note of how much material tends to get moved around between chapters. It's best if we do this in small-ish units, but (for example), if we notice that some text moves from Chapter 15 in 1823 to Chapter 16 in 1831, we'll want to take units that hold the same material (and maybe make that a two-chapter unit). We'll use the markers as signals for the Python script to "chunk" the three texts into units of comparison.
  • Let's check in fairly frequently as we're getting close to the finish on checking, and as soon as we're ready, I can get started on the XML prep and looking for good alignment points.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/2#issuecomment-299334930, or mute the thread https://github.com/notifications/unsubscribe-auth/ASoxxkoQukBfRPGRAELXvF73fsWpTEiRks5r2lyGgaJpZM4KixIw .

ebeshero commented 7 years ago

@Rikkm Understood. I've just pushed the small corrections (after checking against the photofacsimiles). Everything's up to date (and all three of our GitHub branches have all of our commits), so you should be ready to go to keep working on 1823.

I could meet with you sometime before the full meeting on Saturday if you like--just let me know.

Rikkm commented 7 years ago

Hi Elisa,

On the one hand, I have the conference room from 12 - 3:30 on Saturday; on the other, third report we are presenting to most of the library personnel in Monday morning is not yet complete. I'll know more by tonight, though my personal best for a working day tends to hit the wall at 12 hours, and I think I'll need the additional time tomorrow morning and afternoon to continue this work.

I do appreciate your flexibility and the offer, a lot.

Rikk

On Thu, May 4, 2017 at 7:57 PM, Elisa Beshero-Bondar < notifications@github.com> wrote:

@Rikkm https://github.com/Rikkm Understood. I've just pushed the small corrections (after checking against the photofacsimiles). Everything's up to date, so you should be ready to go to keep working on 1823.

I could meet with you sometime before the full meeting on Saturday if you like--just let me know.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ebeshero/Pittsburgh_Frankenstein/issues/2#issuecomment-299340754, or mute the thread https://github.com/notifications/unsubscribe-auth/ASoxxneBK8OnYtffxkjqsR0dKVOYxszDks5r2mX5gaJpZM4KixIw .