CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
31 stars 13 forks source link

Publication Thread for Spring 2019 #22

Closed ctschroeder closed 5 years ago

ctschroeder commented 5 years ago

March 7, 2019 due date. Corpora will be listed after Fall 2018 publications #19

ctschroeder commented 5 years ago

Decision: Will publish Johannes corpus right away before editorial review to help catch formatting TEI and segmentation.

ctschroeder commented 5 years ago

@eplatte @cluckmarq Hi it's March 7. I'm running behind because I'm sick AGAIN. I will be working on Johannes Canons this weekend and they will be ready Monday. So if you all need until Monday please take the time. Christie: will you have more Dirt? I have not gotten to any of those fragments but can squeeze some in this weekend (probably segmentation:checked, auto everything else). Please let me know!
@amir-zeldes Heike has had the flu and hasn't finished Besa 2-3 but is plugging away. LMK if you still want to publish 1. We also could also probably publish 2-3 without translation if you want. Again LMK.

Please reply to this thread or shoot me an email if you have any other questions/concerns about this publication cycle.

cluckmarq commented 5 years ago

@ctschroeder Sorry you are sick! :/ I am hoping to have a hunk of Dirt (GF113-128) by the end of the day (at the latest before lunch tomorrow). Almost done checking tokenization, but am solely working on this today.

ctschroeder commented 5 years ago

that's awesome @cluckmarq! Thanks!

eplatte commented 5 years ago

I hope you're feeling better soon, Carrie! I'm just finishing up my sections of Johannes (making changes from openrefine, adding versification), but I'm not going to be done tonight. I'm hoping to get everything done tomorrow, but they'll definitely be ready by Monday!

amir-zeldes commented 5 years ago

Oh no, I'm sorry to hear you're sick too Carrie. Hope you feel better soon. For Besa 2, honestly it's so short we could include a translation of our own, I could ask one of the Coptic-speaking students here if they want to take it on too. I'd try not to mix translated+not in this release, it's not worth the headache of keeping track of the difference for 1-2 fragments IMO.

Thanks everyone!

cluckmarq commented 5 years ago

@ctschroeder just an fyi. I am still working on correcting Dirt in ether (about 15% of way through text). Orthographic variants in the manuscript have translated into NLP misinterpreting a ton of things. But I continue to work on it, and hope I'll have made it through by Monday.

ctschroeder commented 5 years ago

Ok!

cluckmarq commented 5 years ago

@ctschroeder about 500 lines left to review in ether. i will try to finish up tonight after i put kids to bed. so close....

cluckmarq commented 5 years ago

@ctschroeder done! i've assigned dirt (gf 113-128) to you for review. if it's not you reviewing, just let me know so i can assign properly. i am sure i've missed a few things. the main issue is that gf uses ϭ for ⲑ in several places. but, i think i've caught most of them. and there was one place i couldn't figure out what was going on grammatically: i'm not sure what the grouping currently in line 3261 is.

ctschroeder commented 5 years ago

Ok thanks. I’ll get to this end of the week or early next Christie!

bkrawiec commented 5 years ago

@cluckmarq @ctschroeder Since I wasn't publishing I missed this discussion. It's not that GF "uses ϭ for ⲑ in several places." That's a known factor in the process of scraping the data--when Amir changes David's Word transcription into what we use, that letter consistently gets altered. I usually just search for the letter and change it. Sorry to be late on this!

amir-zeldes commented 5 years ago

Huh, somehow this wasn't on my radar, but looking at the scraper script I was able to find the problem - if we ever have more of this kind of data, it shouldn't happen again. Sorry about that!

ctschroeder commented 5 years ago

Hi @amir-zeldes. There are a bunch of files from 1 Cor & Mark plus single files in a22, victor, abraham, and fox that are marked "review" in GitDox. Are these all treebanking files? I am assuming we are not publishing them in this go-around. Thanks so much!

amir-zeldes commented 5 years ago

We can if we want to, or we can wait for next time, but either way they do not need to be checked (even if there is a stray error somewhere, they should be much more error free than any of the other datasets we release)

ctschroeder commented 5 years ago

Ok yes. I will check the version # and dates for the texts in corpora we are publishing and leave the rest for another time. Can you do me a favor and check the annotation metadata to be sure the right people are credited? Thanks so much!!

ctschroeder commented 5 years ago

Hi @amir-zeldes. I am done looking over the AP and Besa docs! Could you please check the annotators for any of those that were tree-banked and then put them on the private ANNIS instance? Also FYI: I added chapter/verse versification so these docs keep up with our data model. HOWEVER for Besa, this means they don't all validate now, because the validation rule is translation=verse; Kuhn's verses are long, multisentence. So for the old Besa letters with short translation spans, this mismatch makes them invalid. We can either ignore, change validation rules, or move the translation around. Let me know what you think!

ctschroeder commented 5 years ago

@cluckmarq I'm almost done with your Dirt files! Looks good. I'm making a couple of lemmatization and normalization changes with some odd spellings, but I don't anticipate major questions for you. Thanks!

amir-zeldes commented 5 years ago

OK, Liz has been added to annotation of AP1-4, 27-36, since she treebanked them. Besa treebanking was all me, so no need to add.

Before putting the current versions in ANNIS, I'm noticing some of the AP have verse instead of verse_n, and I just discovered online that some corpora have verse (Victor), and some have verse_n (Pseudo-theophilus)... Which one do we want it to be? I should adjust the vis to look for what we decide on.

amir-zeldes commented 5 years ago

RE verse!=trans: it's OK as long as trans never covers multiple verses (opposite is OK, and already the case, compare: http://data.copticscriptorium.org/texts/besa_letters/to_aphthonia/norm)

amir-zeldes commented 5 years ago

OK, Besa is converted and visible to developers as besa.letters_test in ANNIS (log in and toggle visible corpora from 'scriptorium' to 'all')

ctschroeder commented 5 years ago

Hi @amir-zeldes. Thanks for putting up Besa. I'll check it soon. In the meantime can you put Dirt on the private ANNIS? One doc has trouble validating the lang column; it keeps saying some empty cells don't conform. I've tried everything -- deleting contents, adding valid contents, hitting return, doing this for the whole doc, validating (it validates), and then deleting those contents. But in the end the empty cells still get flagged.

amir-zeldes commented 5 years ago

OK, I actually just re-uploaded Besa because the translation spans were very large and I wanted them sentence-wise for eventual treebanking.

I also got the dirt spreadsheets to validate - there were all sorts of weird hidden values under the existing merged spans, I'm not sure how they got there. One way to get rid of them seems to be to merge the cell above them into them, then unmerge.

The problem I have with dirt now is that GF113-128 is very large - about double NBFB. I know they are contiguous, but can we break the pages into two documents? I'd say GF122 could be a good spot - close to the middle and starts a new sentence. I would re-number the chapters then though, so we have a new chapter in GF122. Does that sound OK? If so I can make the partition myself, just let me know.

ctschroeder commented 5 years ago

Hi @amir-zeldes!
Re Besa: did you change the verse numbering? Those numbers are Kuhn's and we are trying to keep to canonical numbering. I did notice the long spans but didn't change them for that reason.

Thanks for fixing Dirt!

Re Dirt GF 113-128: please do not change the chapter divisions. Those are David's divisions; I realize versification is ultimately arbitrary or subjective, but I would like to keep the chapter/paragraph divisions of the donating editor. As to where to divide, I would suggest GF 121 to begin a new document, because that's a new folio. It's not a new sentence but it is a new bound group and a new word. I would like to ask @cluckmarq and @bkrawiec what they prefer. Divide at GF 122 (a verso page) because it begins a new sentence or GF 121 because it begins a new folio (recto).

Also, when you get a chance can you put Johannes canons (anything "to publish" OR "review" - should be 8 documents) on the private ANNIS? Not all the metadata is there and not all have vid's but the spreadsheets should be valid and we should be able to see really wonky things to edit. Thanks so much!!

ctschroeder commented 5 years ago

@amir-zeldes I talked to Christie about a couple things incl GF. She and I both prefer breaking at GF 121. I know you prefer GF 122 (a new sentence) bc of treebanking and entities. Do we really need to break it into two? Any possibility we can keep it one doc?

amir-zeldes commented 5 years ago

No, no problem breaking at 121 - it will make a weird sentence boundary, but it's negligible in the context of the treebank (we have some fragmentary sentences anyway). Would you like me to break it there?

I think a long document will be a hassle in all sorts of contexts in the future, so I prefer to have some limit to document lengths. For readers it may also be more convenient to be able to scroll to metadata etc. more quickly, and splitting into two seems like a very minor change.

amir-zeldes commented 5 years ago

RE Besa, Kuhn's divisions are indicated in p_n, so those stay unchanged. If you look at the 'verses' visualization you'll see it's fine. The only thing that changes is the extent of the highlighted region with floating translation when you hover over a part of a Kuhn paragraph. The analytic vis also looks much better this way, so I don't see a downside (plus I needed those spans for treebanking)

ctschroeder commented 5 years ago

Re GF great, so yes you can split at the beginning of 121. Can you mark them Review in Gitdox so I don’t forget to check the metatdata etc?

Re Besa let me take a look. I’m more worried chapter_n, verse_n, and vid_n all keeping Kuhn’s numbers. P_n is easy peasy. I’ll get back when I check.

amir-zeldes commented 5 years ago

OK I just looked at GF, but 121 is not flush with the beginning of the chapter, so what do you want to do about the chapter span? Will GF 121 begin a new chapter (and the last chapter of GF120 is just two groups), or do you want the same chapter number attested in two documents? Technically nothing prevents that, but it does seem a little confusing.

amir-zeldes commented 5 years ago

RE Besa - I didn't change chapter_n, vid_n etc., only the English translation. The rest lines up with Kuhn, and translated is nested within Kuhn spans.

ctschroeder commented 5 years ago

For GF 121/122: pls keep the chapter and verse numbers as they are. Break them across the docs. I may need to renumber -- I am in email convo with David about chapter numbering right now -- but the chapter spans will stay the same. It's fine if they break across docs. Happens all the time.

amir-zeldes commented 5 years ago

OK, split documents are up, I updated the big one to a status to_delete, feel free to remove if the split looks good

ctschroeder commented 5 years ago

thanks @amir-zeldes. I will look at these all tonight or tomorrow. Had a big push reviewing Beth's johannes docs this weekend.

eplatte commented 5 years ago

OK I'm done reviewing Carrie's Johannes docs. Thanks for your patience! I went through all with Open Refine. There is one section of FA143-158 that I couldn't figure out, in line 445 and lines 453-457. I think these are parallel expressions with ⲥⲟⲩⲛ (ⲥⲟⲟⲩⲛ), but I'm not sure what the verbs might be. I also noticed that the main corpus page the metadata value for license is showing as invalid, though it doesn't come up with the metadata validation on each document. I'm sure I've used the wrong quotation marks. @amir-zeldes is there a way we could fix all five documents at once?

amir-zeldes commented 5 years ago

I had a look, it's not just quotes, which should be single here, but also the angle brackets, which should be escaped. So instead of:

<a href="https://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA 3.0 Unported</a>

It should be:

&lt;a href='https://creativecommons.org/licenses/by-sa/3.0/'&gt;CC BY-SA 3.0 Unported&lt;/a&gt;

I fixed it in the database.

ctschroeder commented 5 years ago

@amir-zeldes I'm logged into ANNIS at https://corpling.uis.georgetown.edu/annis/scriptorium and I don't see any of the new documents. shenoute.dirt has only one doc, the ap corpus doesn't have any of the new sayings, etc.

amir-zeldes commented 5 years ago

You need to log in and toggle visible corpora from 'scriptorium' to 'all', since these are not in the white list of corpora to display in scriptorium. Besa and Dirt are in, but I haven't converted AP yet, I think it still has some validation errors in GitDox. Do all AP documents already have verses?

ctschroeder commented 5 years ago

No I only added them to new/modified ones. We can ignore out if you want. I have been trying to keep up with them as we publish. Re validation errors in AP I think I mentioned upthread some ones I couldn’t figure out.

Also Johannes Canons were ok’d for prepublication as well

ctschroeder commented 5 years ago

I will check on Besa and Dirt tonight or Sunday. Thx for the tip on finding them!

ctschroeder commented 5 years ago

@amir-zeldes some prepub notes:

amir-zeldes commented 5 years ago

OK, I added johannes.canons_test and I reset permissions on shenoute.dirt_test (those are the corpus names). Can you check again? It might have been a permissions issue. If you can see besa.letters_test you should be able to see those two as well.

I also had to rename pb_n to pb_xml_id in some Johannes documents, and remove the TEI span. The pb_n seems to follow a different format though, so unless that's intentional, they should probably be renamed to FA143 etc. (not just a number)

I'll take a look at Besa vis and AP next - do we want to release them without verse_n?

ctschroeder commented 5 years ago

@amir-zeldes re besa and ap: I don't have time to add verses to all the AP so yes, release w/o verse_n in all of them. Besa should already have verse_n in all docs, no?

amir-zeldes commented 5 years ago

Yes, I think Besa is good to go.

amir-zeldes commented 5 years ago

AP053 is fixed

ctschroeder commented 5 years ago

Released May 31