amir-zeldes / gum

Repository for the Georgetown University Multilayer Corpus (GUM)
https://gucorpling.org/gum/
Other
88 stars 50 forks source link

Guide for where to correct what #8

Closed amir-zeldes closed 8 years ago

amir-zeldes commented 8 years ago

For public contributions to error correction, we need a wiki page explaining which format is primary for which annotations (e.g. POS tags are updated in .xml files and will automatically propagate to syntax files and PAULA, but not the other way around)

reckart commented 8 years ago

Btw. is there documentation of the XML format somewhere? Or possible even some open source code that reads/writes that format?

amir-zeldes commented 8 years ago

It's just the plain old Corpus Workbench vertical format, used by CQP for input corpora and by TreeTagger as an output. It's probably documented in the CWB manual somewhere. You can also see a brief description in the CQPWeb paper - http://www.lancaster.ac.uk/staff/hardiea/cqpweb-paper.pdf . It's basically one token per line, or an opening or a closing tag, and tab delimited token attributes.

There is a SaltNPepper module that reads and writes this format in Java (SNP is open source). It's called the TreeTagger module(s), more or less for historical reasons:

https://github.com/korpling/pepperModules-TreetaggerModules

reckart commented 8 years ago

Ok, I know cqp. Then I guess the GUM corpus data files are the documentation themselves. CQP is generic and doesn't define which columns, pseudo-XML tags and attributes are available. So I guess the GUM corpus data files themselves serve as documentation. Since you say you have a build bot, I assume that the format/semantics of columns/tags is not bound to change a lot.

amir-zeldes commented 8 years ago

Implemented at: https://corpling.uis.georgetown.edu/gum/build.html