cdli-gh / mtaac_work

MTAAC work packages
https://cdli-gh.github.io/mtaac/
10 stars 3 forks source link

Bratt standoff annotation format converter to CDLI-CoNLL format #37

Closed epageperron closed 6 years ago

epageperron commented 6 years ago

Summary

As part of our gold corpus annotation pipeline, we need a converter from the Bratt standoff annotation format to CDLI-CoNLL format.

Other links or relevant information

The converter should first fetch the CDLI-CoNLL data from the database for the text being converted and reuse the ID, FORM, SEGM and XPOSTAG columns. (Since the db isn't set-up yet with the CDLI-CoNLL field, this will have to be done from a file for now.)

It should then convert the syntax and semantic annotations from the Brat standoff format to the CoNLL-U format for the columns HEAD, DEPREL (not using DEPS) and MISC.

Semantics will be added in following custom columns which we have to define as part of #30 so at this time it is partly blocking this task.

For information about the CoNLL-U Syntax field format ( which will be exactly the same in CDLI-CoNLL except for DEPS which we will not be using) see http://universaldependencies.org/format.html at "Syntactic Annotation"

For information about the Brat format see http://brat.nlplab.org/standoff.html

We have a working converter from CoNLL-U to Brat which can also reverse the process although we haven't tested that yet. the code could be reused? See here: https://github.com/cdli-gh/conllu.py

For example of CDLI-CoNLL files as they will appear in the database, see in the MTAAC dive MTAAC > Annotations > Annotation Test (morph) > Fully Annotated Files

Roadmap Data

πŸ—“ Start Date: 2017-11-28

πŸ—“ Expected Date: 2018-01-03

πŸ’ͺ Label: wp

πŸ“ˆ Progress (0-1): 0.1

See Gantt: http://cdli-dev.org/gantt/mtaac_work/