CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
31 stars 13 forks source link

Publication thread for Fall 2018 #19

Closed ctschroeder closed 5 years ago

ctschroeder commented 6 years ago

- More Johannes canons? (@eplatte ) - Some Kinds of People Sift Dirt (@cluckmarq) - God Says Through Those Who Are His (@bkrawiec )

For automated corpora we will add info about fully automated tokenization, annotations in metadata.

Possible: -AP (if Marina has new AP) -Besa (@amir-zeldes @somiyagawa) (Besa from two main codices) MOVED to #22 - needs permission from Heike Behlmer for translation - needs to be broken into documents - needs to align translation (scrape translation text will take a few days; need to manually align translation) - needs metadata

Before publication when checking metadata:

amir-zeldes commented 6 years ago

RE Besa's letters - there are way more than two - all of MONB.BA and BB, so almost all of the existing letters. We've only processed two so far though, and I'm not sure if we'll run into problems with later ones.

eplatte commented 6 years ago

Yes, I plan to do some more Johannes. I may also have something from Budge from my Coptic reading group at Reed.

ctschroeder commented 5 years ago

I'm looking at items in Gitdox for publication. Looks like in addition to treebanked material in Mark, 1 Cor, Victor, A22, and AOF we also have Eagerness docs. Is that correct, @amir-zeldes ? Are these newly treebanked Eagerness docs?

We also have a doc from Not Because a Fox barks. Are we republishing this corpus?

Last, there look to be some validation issues. I will go through them and let you know if I have any problems or questions.

(I am skipping the AP, since we will have new AP to publish in the Winter.)

amir-zeldes commented 5 years ago

Eagerness has no treebanked documents, so any edits are presumably sporadically noticed errors (probably no more than a handful). If there are no new documents, maybe we should wait with Eagerness until there is new material - I think there were still new documents coming in the future, right?

Similarly NBFB may have some tiny correction, but otherwise nothing new really. I may wait with it until we are closer to 'one click publication'. The rest have considerable changes due to treebanking and should be re-imported, they are much better quality now.

bkrawiec commented 5 years ago

I do not think there are more documents for Eagerness. I have been done with it for awhile and have moved on to Those.

Becky

Sent from my iPhone

On Oct 21, 2018, at 10:06 AM, Amir Zeldes notifications@github.com<mailto:notifications@github.com> wrote:

Eagerness has no treebanked documents, so any edits are presumably sporadically noticed errors (probably no more than a handful). If there are no new documents, maybe we should wait with Eagerness until there is new material - I think there were still new documents coming in the future, right?

Similarly NBFB may have some tiny correction, but otherwise nothing new really. I may wait with it until we are closer to 'one click publication'. The rest have considerable changes due to treebanking and should be re-imported, they are much better quality now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/CopticScriptorium/corpora/issues/19#issuecomment-431671761, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH7jHKtDyzmLDdgTR9mjJPOIM2-wQUwMks5unH9jgaJpZM4Uk1OQ.

amir-zeldes commented 5 years ago

That's good to know thanks! We could try to squeeze them in, but as I wrote above, the changes are probably minimal, so maybe we should wait until we treebank some of Eagerness.

Another question about Mark/1 Cor - I see failed validations due to 'p' missing - do we want to require p? If so in what units? p mainly serves to segment the normalized view for convenience, but for Bible chapters, the verses already do a good job of that, so maybe we can remove this requirement for the Bible?

ctschroeder commented 5 years ago

Hi. I was making a list of things to go over as I was reviewing the corpora for publication, and "p" was on the list. It relates to our decision in DC to minimize the number of visualizations & viz names, as well. I think we can change the validation to p | vid_n. I am adding vid_n (the cts urns at the verse level) to all corpora as they are re-published. The visualizations break the text at p or at v, right? v is the verse number written as a number, and vid_n is the urn for the verse (same span as v). I would prefer the validation to be p | vid_n to remind us to add those cts urns.

ctschroeder commented 5 years ago

I'm making a list of things that are coming up, that I'll post when I'm done. But two big ones:

amir-zeldes commented 5 years ago

Mark is only 1-6 treebanked at the moment, same as 1Cor. Meta should show everything as gold for the treebanked chapters, and pos/seg 'checked' for the rest, parsing 'auto'. The tag/seg/parse metadata is document-wise, so mixed corpora should not be a problem.

amir-zeldes commented 5 years ago

RE p-annotations: I went ahead and made the p-check corpus dependent. There is no current way to make one validation check for either/or, so we'd need a separate rule to require vid_n in corpora where that's relevant (all corpora?)

ctschroeder commented 5 years ago

Can you go in and edit the parsing /tagging/segmenting metadata? There are a number of docs “review” or “to publish” and there is no way for me to tell currently which ones in each corpus are treebanked (since the parsing data is elsewhere). I realize a mixed corpus re treebaninking is fine technically, but from an annotating/curating point of view, having me make those edits to the metadata is asking for trouble, because Some corpora have both treebanked and nontreebanked docs for review or publication. I think you need to go in and make the changes since there are mixed treebanked corpora.

ctschroeder commented 5 years ago

I will for sure check the rest of the metadata and add corpus metadata to the corpora (like 1Cor) without.

amir-zeldes commented 5 years ago

I can do the automation meta for Sahidica. It's pretty well documented though, the list of treebanked documents is in the table here:

http://copticscriptorium.org/treebank.html

ctschroeder commented 5 years ago

Generally the process is most effective and accurate when each annotator adds/edits metadata when they annotate. Otherwise there is a lot of back and forth from the person doing review, or something gets missed. There's not an effective way for the person conducting the final editorial review to keep in their heads which metadata might change and which might not for each publication thread. The person doing the review (not always me) needs be able to look at the metadata for obvious errors, like typos or missing fields, but other than version number/date isn't expected to go through each existing field and ask whether the data needs to be changed. I will go ahead and reassign docs back to you to for checking the parsing/tagging/segmentation metadata before publication. Thanks!

amir-zeldes commented 5 years ago

ⲞⲔ, 1Cor and Mark should be good to go from the NLP metadata perspective. I also corrected any validation errors that are automatically caught, so they're all green, but I'm not sure if there's something we wanted but haven't added a validation for yet.

amir-zeldes commented 5 years ago

? I'm not sure I understand the preceding comment - I have no metadata changes to make that I'm aware of. I'm happy to keep NLP metadata up to date as we treebank in the future, but these are fields that didn't exist when the treebanking happened. Sahidica is now up to date.

ctschroeder commented 5 years ago

If I have any questions about the other mixed corpora besides the Sahidica ones, I'll let you know.

ctschroeder commented 5 years ago

Thanks for editing the Sahidica ones!

ctschroeder commented 5 years ago

@amir-zeldes can you tell me who has been treebanking (and then correcting tagging/segmentation) for the AOF, Victor, A22, Mark, 1 Cor texts? I will add their names to the corpus and document metadata. Thanks!

amir-zeldes commented 5 years ago

Mark+1cor new material is Mitchell. A22 is me and Liz. AOF is just me, Victor is me, Mitchell and the four Israelis listed here: https://github.com/UniversalDependencies/UD_Coptic-Scriptorium/blob/dev/README.md (under acknowledgments)

ctschroeder commented 5 years ago

Thank you! 1Cor is ready (I also added corpus metadata to GitDox) except for these questions:

A couple of questions about Mark:

ctschroeder commented 5 years ago

Sorry -- the first two questions in the previous comment were about Mark! I've edited. @amir-zeldes Mark is ready except for those questions about 7 &9. Thanks!

ctschroeder commented 5 years ago

Part 1 of Martyrdom of Victor is now ready; as with 1 Cor there were a few unmerged spans in the _group layers, which I merged. Letting you know in case it affects the treebanking data.

amir-zeldes commented 5 years ago

I'm guessing the status of Mark 7+9 indicates a sporadic edit when an error was noticed. They will all be republished, since the ANNIS data is corpus-wise, meaning it makes sense to publish the entire corpus in its current state. Thanks for pointing it out!

ctschroeder commented 5 years ago

Ok great. I will try to look at Mark 7 & 9. Please hold off on publishing until I post again. Thanks! Sent from my iPhone

amir-zeldes commented 5 years ago

Not publishing yet, but the aim is to make more and more of what could conceivably go wrong be flagged by validations. Luke has added a new export validation which validates the TEI resulting from conversion, which should help too - I'm trying to adapt the schema as there are still some corner cases which don't work out. I'll be in touch about this.

ctschroeder commented 5 years ago

Ok. Some of the Bible corpora docs have ' instead of apostrophes. Is there some way to do a global search and replace in the spreadsheet? Thanks

amir-zeldes commented 5 years ago

Apostrophes where? What field are we talking about?

ctschroeder commented 5 years ago

Sorry I didn't specify. Translation layer.

ctschroeder commented 5 years ago

@amir-zeldes Mark 7 & 9 are proving to be rather time consuming, since the data in there is quite old (i.e., the tagging dates to before we had portmanteau tags!). Is there any way we can release the corpus using the old versions of Mark chapters 7-end? The catch is there are corrections in Gitdox and committed to GitHub for Mark 7 & 9. (I'm reluctant to release it w/a new version number and as "checked" when the data has obvious problems. And it's going to take a while to update.)

ctschroeder commented 5 years ago

So, reviewing some of this old data that's been treebanked is proving tricky. @amir-zeldes can you please remind me which layers cannot be modified in the treebanked docs? I know POS, norm, lemma, morph. What about the rest -- do any others matter? Including line breaks, tokenization changes to add things like hi@rend elements (that do not affect norm/pos/lemma but do change base tokenization), translation spans, etc. Thanks!

amir-zeldes commented 5 years ago

Sorry just seeing these:

  1. Apostrophes: are you concerned that different characters are used to represent apostrophes in the translations? Or do you mean vs. escaped &apos;? I'm not sure this is standard across our corpora, probably not, but I haven't been too worried about unifying that (I imagine it would be a lot of work figuring out everything that's not consistent about them, so maybe it's best to just accept the variation).
  2. If there are changes to Mark 7+ and they are better than what was already online, why not release them all the same? Even if there are still errors, they're less than before, no? We could also try to re-auto-tag them, but of course that won't solve segmentation errors.
  3. Treebanking must be in sync across norm, pos and lemma. Group information is dynamically fetched from GitDox every release and is kind of FYI anyway in the treebank, so not critical, and the same is actually true for morph - but usually if morph is changed, so is norm, and that is critical. Base tokenization changes are completely invisible to the treebank and therefore fine, as are linebreaks. Translation spans are implicitly assumed to be the same as sentence spans in the TB, but actually they do not appear anywhere in the TB data, so while I would advise to try to keep them in sync, I know for a fact that they are not synced in some subcorpora (but for Bible for example it's 1:1)
ctschroeder commented 5 years ago

Oh I just noticed that Markdown changed &apos; to an apostrophe in the comment above. Whoops.

ctschroeder commented 5 years ago

Ok we are posting at the same time. Thanks!!! A couple clarifications:

  1. Let me rephrase. In a lot of the translations I'm seeing, in the layer I see &apos; instead of '. a) Does this matter for search or TEI/SGML/XML export, etc? (So are these appearing as &apos; anywhere in the published data instead of as '?) b) if YES then is it possible do a global search and replace?
  2. The problem isn't that the data is a little better and could be a little better -- it's that the current data doesn't reflect our existing tagging guidelines or other significant guidelines. It's not a matter of a few mistakes but numerous changes that are quite time-consuming. It's not really 2.6.0 data. I would prefer to leave Mark 7 on as is in the next publication and change them as the treebanking is done if at all possible.
  3. Ok so what I hear you saying is don't change norm/lemma/pos at all, preferably not morph and preferably not the translation spans but changing the translation text or the translation span is not a crisis. I ask because again, these older corpora have other updates needed in other layers. For example versification: it typically follows translation, translation wasn't always one sentence per span, etc. So if I change the translation spans it's not a crisis?
ctschroeder commented 5 years ago

Victor and 1Cor are ready

ctschroeder commented 5 years ago

@amir-zeldes A22 is now ready; pls double check segmentation/tagging/parsing metadata. Thanks! Abraham and Mark may need to wait until next week or after TG

ctschroeder commented 5 years ago

Wait hang on A22 needs Liz’s name added to metadata for corpus and whatever docs are treebanked. I am not at my laptop now. Will comment again when it’s done

amir-zeldes commented 5 years ago

Sounds good - this is great progress!

ctschroeder commented 5 years ago

A22 ready.

ctschroeder commented 5 years ago

(But do please check the segmentation/tagging/parsing metadata to make sure it is all correct. Thanks @amir-zeldes !)

amir-zeldes commented 5 years ago

OK, I fixed a couple of things, now A22 converts correctly and machine processing metadata is correct. This just leaves Mark, AOF, and maybe Besa depending on whether or not we want to re-release to include Vigilance in this round.

ctschroeder commented 5 years ago

I am almost done with AOF. More soon.

bkrawiec commented 5 years ago

Is there something I am supposed to bedding on AOF?

On Nov 26, 2018, at 1:37 PM, Caroline T. Schroeder notifications@github.com<mailto:notifications@github.com> wrote:

I am almost done with AOF. More soon.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/CopticScriptorium/corpora/issues/19#issuecomment-441748806, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH7jHNqL2--zZco5bT3rikyHBEwxALNDks5uzDTegaJpZM4Uk1OQ.

ctschroeder commented 5 years ago

Nothing to worry about!  A couple sections were tree-banked, so I am adding verses and chapters.  It's not a big deal.  Just updating those two files to current standards since we are republishing.

Sent from iCloud

ctschroeder commented 5 years ago

AOF is done

ctschroeder commented 5 years ago

Here are the urns that need addressing. We either should put something in the 404 or somehow redirect. Probably easier to list them on the 404 for now. I think any redirect is complex with this application. urn-table.xlsx

ctschroeder commented 5 years ago

Also @amir-zeldes can you check the parsing/segmentation/tagging metadata for the files that were treebanked? Not sure I got them right.

ctschroeder commented 5 years ago

Mark is as ready as it will be. @amir-zeldes I think we're good to go.

amir-zeldes commented 5 years ago

OK, I went over the AOF exports, they all check out now. The way to see those line numbers is to do a TEI export from the editor, download the file, and find that line number in the file. Usually I look for some word or translation nearby and actually check out the grid instead of trying to figure out the XML, since the error is usually apparent in the grid too.

XL still needs versification, it seems - did you say you wanted to just follow the translations or doing something else? Once that checks out I think we really are good to go!

Oh, and one more thing, is Besa/Vigilance included?

ctschroeder commented 5 years ago

Hi,

I didn't add verses to the part of XL that is not AOF. XL is a florilegium codex.  I'm not sure what work that piece of XL is from.  I'd have to look it up.  The AOF section has verses and verse ids.  I did go through just now to make sure the spans coincide with each other.  If empty spans are a problem then you can just put in some placeholder like "undetermined".

I have not had time to touch Besa.  The other corpora ended up more complicated than I anticipated.  Besa is next on my list.  If you want to wait for that it may take a week or more, because I need to check the metadata pretty closely and add all the cts urns, and I have to go over the final white paper comments from board members (Heike's final report is due Dec 15, and we want to be sure they are close.)

Only two letters of Besa are marked for review in Gitdox.  Can you be sure everything you want reviewed is marked for review?  I don't want to miss anything.  Thanks so much!

Best, Carrie

Sent from iCloud

On Dec 06, 2018, at 12:19 PM, Amir Zeldes notifications@github.com wrote:

OK, I went over the AOF exports, they all check out now. The way to see those line numbers is to do a TEI export from the editor, download the file, and find that line number in the file. Usually I look for some word or translation nearby and actually check out the grid instead of trying to figure out the XML, since the error is usually apparent in the grid too.

XL still needs versification, it seems - did you say you wanted to just follow the translations or doing something else? Once that checks out I think we really are good to go!

Oh, and one more thing, is Besa/Vigilance included?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.