Add support for .conllu format?

nschneid commented 8 years ago

Thanks for an excellent tool! I've started using it for annotating with Universal Dependencies, which normally use the CoNLL-U format. This is an enhancement to CoNLL10, the main difference being the ability to represent multiword tokens. I have been stripping out the multiword lines before uploading to Arborator, but it would be nice if they were preserved, and even better if they were displayed (e.g., by underlining groups of tokens). How hard would this be to add?

nschneid commented 8 years ago

Also: CoNLL-U allows comments (lines starting with #) to encode sentence-level metadata.

kimgerdes commented 8 years ago

hello, yes that's something that i've started to look into. there are two issues to address:

what will actually be the representation of these mwes? on http://universaldependencies.org/u/dep/all.html example 138 the graph simply ignores them, and the sentence simply shows the mwe. so this is readable only with knowledge of the language and doesn't seem ideal. but more complex representation possibly including some kind of grouping is non-trivial, especially on the annotation side. how could an annotator modify this grouping of words?
the second point is the graph structure. currently arborator already allows for a graph (multi-head) representation but the syntax is different: the token in conll simply has to be repeated with the same number and then given a different governor. i think that this is a more elegant solution than the conll-u format where they need special syntax ( _, :, | - that need an escape mechanism themselves if they are part of the function name...) to encode multiple governors. but hey, i guess it has become some kind of standard that arborator should be able to read.

do you have any ideas on that? would you have some time to discuss the actual implementation or even to help modify the code?

some weeks ago, i already experimented with the UD representation tool Annodoc (which is utterly complex,) in order to include it as an output format in arborator's quick page because i needed it for a paper where i wanted the trees to look "universal". you can see the ramshackle implementation here: http://arborator.ilpga.fr/q.cgi - click on the "Show Annodoc graph" button. Sometimes you got to reload to get the graph back. their javascript and mine are still interfering. it's usable but not yet "pushable" to github...

nschneid commented 8 years ago

As I understand it, the multiword tokens simply record orthographic words that were tokenized into multiple syntactic words (like could've → could 've, au → à le). The dependencies are only over the syntactic words. So unless Arborator provides a way to modify tokenization in general, I don't think there needs to be a way to modify the multiword token groups.
Regarding the enhanced dependencies with multiple heads: I think it would be fine to convert to another representation for Arborator's internal use, so long as import and export works with the UD standard.
Implementation: I may be able to help write code if you could point me in the right direction. Also, validate.py at https://github.com/UniversalDependencies/tools may be helpful.
Visualization: The Annodoc graph is just the brat renderer, right?

amir-zeldes commented 8 years ago

Hi Kim - if you can expose the textual format underlying the annodoc graph, that could be useful for preparing UD documentation, like this:

sdparse
Ivan is the best dancer
nsubj(dancer-5, Ivan-1)
cop(dancer-5, is-2)
det(dancer-5, the-3)
amod(dancer-5, best-4)

My examples for Coptic UD are all in conll10, since they're all annotated in Arborator, and it would be nice to move from a separate PDF to the UD online documentation system with automatic conversion of the examples.

kimgerdes commented 8 years ago

hi guys,

The quick page (see http://arborator.ilpga.fr/q.cgi) now supports the CoNLLu format (http://universaldependencies.org/format.html). this includes:

comments preceded by "#" are ignored in the representation and preserved when the tree is modified
words that span over more than one token:

    1-2    vámonos   _
    1      vamos     ir
    2      nos       nosotros

these special lines are used only to construct the sentence on top of the tree, but not graphically modifiable. they are rewritten at the right position in the conll.

the "xpostags" and "features" (5th and 6th) columns are preserved but not modifiable.
multiple governors: the "deprel" (8th column) can now encode multiple governors (used only if more than one governor). Even idiosyncratic functions containing columns (example: nmod:poss) can be used (the first column separates the governor's number). However, the function name may not contain pipes ("|") as those are used to separate multiple governors in this column (used if more than 2 governors per word).
the last column is shown as a gloss beneath each word if it's not equal to "SpaceAfter=No". in the latter case, it is used to create the correct string representation of the whole sentence.

One problem is the definition of extra dependencies beyond the tree: are these extra dependencies different? do they have to be encoded separately? This is not well-defined in the CoNLLu format. I suppose that all governor's are equal and the column representation just encodes some governors in the common position and others in the special column.

Now the Arborator writes the first governor (by order) into the normal spot and additional (later) governors that appear later in the sentence are written into the special column. This means that the order between multiple governors can change after tree modification (between the usual columns for governors and the extra governors' column as well as inside that special column).

You can try it out on the arborator page. The next step would be to integrate it into the python code of the database based side of the arborator. This would have to be done in the tree2nodedic function of conll.py and possibly in the database.py file in order to store the additional information somewhere for re-exporting. i won't have much time very soon for that. so if you can give it a try, i'd be grateful.

amir: concerning the textual format you mention, the annodoc also supports conllu directly. so why would you need this other (old stanford) format? you can also try the new button in the quick page to get ud-style graphs. if you could improve this (remove interfering js, possibly export to svg, ...) that would be great!

nschneid commented 8 years ago

multiple governors: the "deprel" (8th column) can now encode multiple governors (used only if more than one governor). Even idiosyncratic functions containing columns (example: nmod:poss) can be used (the first column separates the governor's number).

Do you mean, functions containing colons?

One problem is the definition of extra dependencies beyond the tree: are these extra dependencies different? do they have to be encoded separately? This is not well-defined in the CoNLLu format. I suppose that all governor's are equal and the column representation just encodes some governors in the common position and others in the special column.

It says "enhanced representations may require additional dependency relations", which is indeed vague. For English, the "enhanced" and "enhanced++" representations are described here. As I understand it, there are tools that heuristically add the enhanced edges given the basic tree. So in a sense, the enhanced edges are secondary, but I don't know if there will be much need to annotate them manually if they can be added automatically.

nschneid commented 8 years ago

The visualization looks great! One request that I hope would be easy to add: also displaying an orthographic word layer below the token layer, if they differ. E.g. for

1-2 that's  _   _   _   _   _   _   _   _
1   that    that    PRON    pro:dem _   2   SUBJ    _   _
2   ~be ~be VERB    cop &3S 0   ROOT    _   _
3   a   a   DET art _   2   PRED    _   _
4   terribly    terrible    ADV adv &dadj-LY    5   JCT _   _
5   small   small   ADJ adj _   6   MOD _   _
6   horse   horse   NOUN    n   _   3   XMOD    _   _
7   for for ADP prep    _   6   NJCT    _   _
8   you you PRON    pro _   7   POBJ    _   _
9   to  to  PART    inf _   10  INF _   _
10  ride    ride    VERB    v   _   6   XCOMP   _   _
11  .   .   PUNCT   .   _   2   PUNCT   _   _

The token layer is what is currently displayed ("that ~be a terribly...") and the orthographic word layer would be "that's _ a terribly...", with "that's" spanning 2 tokens).

kimgerdes commented 8 years ago

concerning the graph structure beyond the tree:

i didn't know that paper. astonishing how they can speak for pages about the representation of the syntax and semantics of light verbs in dependencies and not cite mel'cuk! so if i understand correctly, they distinguish the different types of dependencies, which they even put on different sides of the text. some of the syntactic functions are idiosyncratic (nsubj:xsubj) some are just plain functions (nsubj, amod, ...). this goes against what arborator does right now: arborator can handle graph structures but does not distinguish different types of layers of dependency structures. so after going through arborator, the distinction might be lost in the conll encoding (standard place for governor and conllu place for governor is not distinguished).

concerning the orthographic layer:

this is part of a bigger problem: the non-configurability of the quick page. if we add extra lines, they should be optional, not to make each tree too high (just as the tree height and other parameters). even the gloss we just added takes up space (and i reduced the distance between the lines) i think the orthographic word should definitely be displayable in the main database based part of arborator where everything can be configured. but the quick page is mainly to look at conll files and modify them slightly. but if you think it's necessary and you add it to the code, i'll of course accept the code!

nschneid commented 7 years ago

UD v2 includes a new CoNLL-U specification: http://universaldependencies.org/format.html

The changes from v1 are summarized here: http://universaldependencies.org/v2/conll-u.html

Arborator / arborator-server

Add support for .conllu format? #2

concerning the graph structure beyond the tree:

concerning the orthographic layer: