Closed amir-zeldes closed 8 years ago
Btw. is there documentation of the XML format somewhere? Or possible even some open source code that reads/writes that format?
It's just the plain old Corpus Workbench vertical format, used by CQP for input corpora and by TreeTagger as an output. It's probably documented in the CWB manual somewhere. You can also see a brief description in the CQPWeb paper - http://www.lancaster.ac.uk/staff/hardiea/cqpweb-paper.pdf . It's basically one token per line, or an opening or a closing tag, and tab delimited token attributes.
There is a SaltNPepper module that reads and writes this format in Java (SNP is open source). It's called the TreeTagger module(s), more or less for historical reasons:
Ok, I know cqp. Then I guess the GUM corpus data files are the documentation themselves. CQP is generic and doesn't define which columns, pseudo-XML tags and attributes are available. So I guess the GUM corpus data files themselves serve as documentation. Since you say you have a build bot, I assume that the format/semantics of columns/tags is not bound to change a lot.
Implemented at: https://corpling.uis.georgetown.edu/gum/build.html
For public contributions to error correction, we need a wiki page explaining which format is primary for which annotations (e.g. POS tags are updated in .xml files and will automatically propagate to syntax files and PAULA, but not the other way around)