CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
31 stars 13 forks source link

Publication thread for Spring 2018 #15

Closed ctschroeder closed 6 years ago

ctschroeder commented 6 years ago

Corpora scheduled for publication:

move to Summer/Fall 2018

Reminder of process:

ctschroeder commented 6 years ago

Reminder that deadline is March 19. Version # will be 2.5.0 Version date TBA -- 23 April? 16 April? (FYI I will be traveling, in NYC April 16-21.)

ctschroeder commented 6 years ago

@eplatte pseudo-theophilus is ready for review. Corpus metadata and version date need to be added, but everything else is ready. There is one bright pink highlighted line in one of the docs, but otherwise it was straightforward. The first two docs were published before; the second two are new. Thanks!

amir-zeldes commented 6 years ago

Is March 19 the data freeze date?

ctschroeder commented 6 years ago

Yes.

amir-zeldes commented 6 years ago

Sounds good

eplatte commented 6 years ago

I was under the impression that March 19 was the submission deadline and April 1 would be the data freeze (i.e. all editing would be done by then). It doesn't really matter to me either way, although I have a conference from March 14 to March 18, so I'll need anything to review before then. If we want March 19 to be the data freeze, when do we want submissions? I'll need to email Christie and Marina to give them a new deadline.

ctschroeder commented 6 years ago

My mistake. @amir-zeldes, @eplatte is right: March 19 submission, April 1 data freeze. My apologies!!
What date should we put on the version. April 1? 5? 8? We usually make small edits after the data freeze, because something comes up.

amir-zeldes commented 6 years ago

That's all fine. The release date can be the freeze date if you like - it won't really matter because there won't be a different publication version available online until after the initial release with that date. As long as we move forward in time with the dates all should be well!

eplatte commented 6 years ago

April 1 for the release date sounds good to me!

ctschroeder commented 6 years ago

for @amir-zeldes re AP validation: the "publish" or "to publish" or "review" AP that are not validating right now are ones that have validation errors we don't/I can't currently solve.

  1. These have extra tokens for visualization purposes (for the previous or next column, for example):
    • AP.041.syncletica.09
    • AP.119.Sisoes.16
  2. These have blank lemmas (validation error is "Span break on line X in column norm but not lemma" and the lemma is blank, due to lacuna in manuscript)
    • AP.095.n282.charity
    • AP.100.n294.crocodiles
    • AP.094.n286.charity
    • AP.101.macarius.01
eplatte commented 6 years ago

@ctschroeder Apa Johannes FA29-30 ready for review. See email for what happened to the other documents; hopefully I can get them done for Tuesday.

amir-zeldes commented 6 years ago

@ctschroeder Issue 1 is known and you don't need to do anything about it at the moment. For lacunae, I'd like to suggest we lemmatized them using the string UNKNOWN, to match the corresponding POS tag. Does that make sense? Then anything that has a POS tag also has a lemma, which is nice for consistency and may prevent all sorts of problems.

amir-zeldes commented 6 years ago

Alternatively we can also lemmatize them with their norm string (lemma=norm), I'm open to either.

ctschroeder commented 6 years ago

lemma UNKNOWN sounds good to me. Thanks, @amir-zeldes. @eplatte: ok will check email.

amir-zeldes commented 6 years ago

got it, thanks

ctschroeder commented 6 years ago

@amir-zeldes Mark 1-5 and 7 are marked for review in Gitdox. Is Mark 6 treebanked and ready, as well?

ctschroeder commented 6 years ago

Also: anyone know why all those docs in ISYE corpus are marked "review"? @amir-zeldes @eplatte @bkrawiec We are not planning to republish, are we?

eplatte commented 6 years ago

Those are documents with updated meta, but not for this publication. There may be other corpora like this, though most meta updates were in ISYE and AP I think. I still have some more to do as well.

amir-zeldes commented 6 years ago

No C.6 isn't treebanked so far.

ctschroeder commented 6 years ago

@eplatte Re the empty cells not validating: I think this is an artifact of a merged cell or unmerged cell. I've seen it before, and I merge the non-validating cells with one below or above, unmerge, and it's fine.

ctschroeder commented 6 years ago

I edited the apa johannes FA 33-34 file in sgml and reimported it into the spreadsheet. Something broke. the line breaks didn't import. I even tried reimporting the version-controlled sgml that's stored in gitdox (without my edits), and that file import didn't preserve the line breaks either. @amir-zeldes I don't know if this is an issue with gitdox or with the layer name.

amir-zeldes commented 6 years ago

Can you point me to the exact file in github that is not re-importable? Or send me the culprit via e-mail?

ctschroeder commented 6 years ago

@amir-zeldes & @eplatte I think the AP are ready. Still working on the other corpora.

amir-zeldes commented 6 years ago

Fantastic, thanks!

ctschroeder commented 6 years ago

@amir-zeldes @eplatte pseudo-theophilus should be ready

ctschroeder commented 6 years ago

I just changed the lemmas for two cells in one of the docs. I hope you haven't started these yet. (sorry!!!)

ctschroeder commented 6 years ago

the pseudo-theophilus I mean.

amir-zeldes commented 6 years ago

I just validated TEI, it's almost good to go:

Body validates fine.

ctschroeder commented 6 years ago

Great. You are talking about pseudo-theophilus? I believe version@date and version@n are in each document.
I thought @eplatte was working on the corpus metadata so didn't even look at that. Thanks for the catch. One of us will get this. The double quotes were in the metadata the last time around so I didn't know it was a problem. Would it be possible for you to add a validation check to our validation schema to catch this in the future? Will fix these docs later today.

eplatte commented 6 years ago

I just checked ps-theophilus and version@date and version@n are there. Also, with the double quotes, will validation be a problem for linked metadata, since those have xml? I'll add the corpus metadata now!

ctschroeder commented 6 years ago

I have made the edits to the document metadata. @eplatte I think all that's left is the corpus metadata. I still don't see it in there.

ctschroeder commented 6 years ago

(I wonder if there is something wonky with corpus metadata?)

eplatte commented 6 years ago

OK I just re-added the corpus metadata for ps-theo and re-checked that it's showing up in a separate document, so I think we should be good to go.

ctschroeder commented 6 years ago

🙌🙌🙌

Sent from my iPhone

amir-zeldes commented 6 years ago

OK, sorry about the metadata, that was a very stupid bug (doc_id is NULL for corpus meta, doc_id must be unique or null, and corpus meta ID was being entered as String 'NULL' rather than actual NULL, so it overwrote the last corpus)

Should be fixed now. I added meta for AP based on the previous corpus, but then it occurred to me that maybe the list of annotation/translation changes? Could you check @eplatte and enter the Johannes meta again? Sorry about this!

ctschroeder commented 6 years ago

made the changes to pseudo-theophilus. That corpus is ready for redoing.

ctschroeder commented 6 years ago

@amir-zeldes @eplatte not sure if you saw my comment on Friday. Ps-th is ready. I'm trying to look over Apa Johannes tonight

ctschroeder commented 6 years ago

ok that was easy! Apa Johannes is ready.

ctschroeder commented 6 years ago

Also I don't know why it's telling me "Metadata for facsimile_graphic_url does not match pattern"

eplatte commented 6 years ago

I'm not sure about that, either. I didn't change the meta, and I'm pretty sure it was all validating before. I've just checked the TEI for all of Apa Johannes and re-checked ps-Theophilus just in case, and it's all valid!

amir-zeldes commented 6 years ago

It had double quotes in the href, instead of single quotes. I just changed it, it should all validate now.

ctschroeder commented 6 years ago

Oh thanks. The link functioned in gitdox so I couldn’t tell what was going on.

amir-zeldes commented 6 years ago

Yes, double quotes work fine in GitDox, but lead to &quot in ANNIS, which breaks the link...

ctschroeder commented 6 years ago

Thanks for the clarification. We can't see that in GitDox, since it turns into a live link.

amir-zeldes commented 6 years ago

OK, pseudo.theophilus, AP and johannes are up for review in ANNIS. Talk in a bit...

eplatte commented 6 years ago

I've taken a look at all visualizations and all metadata for all new documents, plus corpus metadata, and it all looks good to me.