CopticScriptorium / corpora

Public repository for Coptic SCRIPTORIUM Corpora Releases
32 stars 13 forks source link

Make checklists for releasing and updating corpora #3

Closed ctschroeder closed 9 years ago

ctschroeder commented 9 years ago

Needed: list/workflow to mark texts that have been edited and are awaiting re-publication when the corpus is next published. (And so list of corpora that should be published when we do publish a bunch of material.)

Releasing:

  1. New and revised docs should be reviewed by Senior editor. (Check questions from annotators in the document and/or pull request, read through the document, use Google Refine to see if errors in tokenization, pos-tagging, lang-tagging, morph annotation, and normalization pop. Be sure layer names conform to standards (see layer annotation documentation). Use Add-in if possible to confirm norm annotation is the same span as orig, groups are the same size, etc.)
  2. Add/correct metadata on new documents: Confirm metadata all conforms to standards on layer annotation documentation on the wiki. Pay close attention to names of annotators, version number, and version date for documents. Typically a newly published document will have v. 1.0.0. If you wish to publish a document and use the visualizations to help correct and proofread, you may release as 0.1.0 (or a number lower than 1.0.0).
  3. Check the Issues list for each corpus to be released (whether new or revised versions of documents). Each corpus may have a list of errors noticed by users or team members. (E.g., https://github.com/CopticScriptorium/ap-dev/issues/35). Make corrections, and note on the issues list that the corrections have been made.
  4. Add/correct metadata on edited, previously published documents: Confirm metadata all conforms to standards on layer annotation documentation on the wiki. Pay close attention to names of annotators, version number, and version date for documents. Versioning: +1.0.0 for major change to data and/or structure (entirely new layer annotation, entirely new tokenization method applied, etc.); +0.1.0 for significant edits but still structurally compatible with previous versions; +.0.0.1 for minor edits, e.g. fixing reported errors in transcription or pos-tags). Note: an annotator may have made a minor change a while back and changed the version # and version date accordingly, even though the revised document has not yet been published. We do not republish a corpus every time we make a minor revision to one document. You may wish to check the document's version # against the number in ANNIS.
  5. Add/correct the corpus metadata. Corpus metadata appears on the first document in a corpus. Confirm metadata all conforms to standards on layer annotation documentation on the wiki. Pay close attention to names of annotators: the names of all annotators of all documents in a corpus should be in the corpus metadata; if someone has edited one document, be sure that person's name appears in the corpus metadata. Version date should be the date of re-release. Version #: +1.0.0 for major change to data and/or structure (entirely new layer annotation, entirely new tokenization method applied, etc.); +0.1.0 for significant edits but still structurally compatible with previous versions; +.0.0.1 for minor edits, e.g. fixing reported errors in transcription or pos-tags).
  6. Convert to TEI. Confirm that the document validates against the EpiDoc TEI schema. http://www.stoa.org/epidoc/schema/latest/tei-epidoc.rng (Edit if necessary if problems with validation.)
  7. Convert to relANNIS and PAULA.
  8. Post TEI, relANNIS and PAULA files to GitHub public repository in their respective directories
  9. Check ANNIS visualizations, etc.
  10. Create a new release of the GitHub corpora repository, posting information about the latest changes in the release.
  11. New ingest at data.copticscriptorium.org to account for new data (create new corpora, etc., if necessary; see documentation in wiki for this application)
ctschroeder commented 9 years ago

@ctschroeder ready to add to wiki

ctschroeder commented 9 years ago

published on wiki