UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
269 stars 245 forks source link

v1->v2 conversion script? #376

Closed nschneid closed 7 years ago

nschneid commented 7 years ago

I'm involved in an annotation project that has produced some v1 data for English (POS + basic dependencies). We'd obviously like to automate as much of the conversion to v2 as possible. Does anyone have a converter tool? I didn't see any listed on http://universaldependencies.org/tools.html.

Also, we'd appreciate guidance on which aspects of the conversion CANNOT be automated. My impression from the summary of changes is that the new treatment of ellipsis cannot be fully automated, but most of the other changes are deterministic.

sebschu commented 7 years ago

I don't think we have anything like that so far but I'll definitely write some scripts to convert the English Treebank to UDv2 and I can share them with you once I have them ready (probably at some point in early/mid January).

And yes, most of them can be done automatically, but as you say, elliptical constructions cannot be automatically converted but these examples should be quite rare and you can find them by looking for sentences with a remnant relation.

The second thing that cannot be fully automated is the split of nmod into nmod and obl. Most of the time it will be the case that whenever a PP depends on a predicate, the relation will be obl but in copular constructions with a nominal head this is ambiguous and it depends on whether the PP modifies the nominal or the entire clause.

For example:

The talk is in the Greenberg room at noon.

obl(room, noon)

But:

This is the key to the apartment.

nmod(key, apartment)

I think everything else can be done automatically (at least in English).

jnivre commented 7 years ago

@sebschu Thanks for sharing your efforts. Everyone will have to do this, so to prevent a huge duplication of effort and give support to teams who are less familiar with the annotation scheme, it would be great if we could provide a script for the whole community that does automatic things directly and flags other cases for manual inspection. We have not promised such a script, because we didn't know if and when we could make it available. Do you think any of this is likely to happen soon?

martinpopel commented 7 years ago

I am also working on the script (using Udapi). My plan is to mark unclear cases with a comment in MISC. I will let know once I have something.

Another issue is adding enhanced dependencies for Propagation of conjuncts. Dependencies of the first conjunct cannot be automatically distinguished whether they are shared or private. Based on my survey over few languages a reasonable heuristics is that if the dependent is before the first conjunct or after the last conjunct, it is shared. Note that in most Prague-style treebanks this distinction is already annotated (so e.g. for Czech, we will rather update the PDT-to-CoNLLU script).

sebschu commented 7 years ago

@jnivre I plan to get started with this next week and I'll try to have a first version of this before the holidays.

I agree that it would be useful to prevent duplicate work as much as possible. I am just a bit worried that what I have in mind might not produce correct results for all languages. For example, in English, neg will always be replaced with det or advmod but I'm not sure if this is true for all languages (e.g., is this even true for French?). If we provide the script to everyone, it should therefore come with a big disclaimer and individual treebank maintainers would have to check carefully if the output makes sense.

Also, a lot of the changes to the features won't apply to English, so I won't do anything about them.

But I'll try to make it as "universal" as possible and I'll try to make clear which parts are most likely language-specific and might have to be adapted for other languages and which parts should produce the correct results for all languages.

jnivre commented 7 years ago

@sebschu @martinpopel Thanks for doing this. I completely understand that you cannot take responsibility for producing something that will work for all languages, so disclaimers will be necessary. However, if you try to "think universally" whenever possible, I am sure it will be easier for people to adapt the script to new contexts.

fginter commented 7 years ago

I'll try to breathe some new life into a tool we have developed here in Turku for treebank conversions. Over the years it was used on numerous occasions, but also transformed itself into one huge hack. It reads a config file with rules that match arbitrary structures in the source and produce a single dependency in the target.

If I succeed in reviving the monster, I will post it here. :)

amir-zeldes commented 7 years ago

I was planning on using a DepEdit job to update the Coptic data, see: https://corpling.uis.georgetown.edu/depedit/

It's true not everything will be 100% automatable, and many things will be language specific, but it might be a good start for some people who don't want to delve into programming too much (it's just a 3-column configuration file specifying tokens to find, their subgraph relations, and what to do to them).

I'm happy to post a link to the job file here once I get around to doing this.

spyysalo commented 7 years ago

Piling on: I'm also interested in developing / contributing to the development of an automated conversion. I wouldn't mind working with @fginter 's monster, but would be happy to use any other reasonable framework.

fcbr commented 7 years ago

Still in early stages, but we are building a Common Lisp library to manipulate CoNLL-U files (https://github.com/own-pt/cl-conllu) and using it to do the automated parts of our conversion (e.g.: https://github.com/own-pt/bosque-UD/blob/master/scripts/fix-issue-108-nao-VERB.lisp)

fginter commented 7 years ago

Seems we have lots of options here. :) My archaeological excavation site is here: https://github.com/TurkuNLP/dep2dep/ and an example config file here: https://github.com/TurkuNLP/dep2dep/blob/master/dtreebank/dep2dep/example_rules.lp2lp I guess this is what Turku will use for our v1->v2 conversion. The primary advantage of this tool is that it can handle non-trees on input, ie it can convert also the existing extended layer and make use of our PropBank annotation. The PropBank annotation will help e.g. in the obl vs nmod distinction.

jnivre commented 7 years ago

I started a table at http://universaldependencies.org/v1_to_v2.html for specifying the desired behavior of v1-to-v2 converters and (eventually) changes needed to the validation script. Please feel free to contribute. :)

arademaker commented 7 years ago

It is good to know about so many tools under development: Udapi, depedit and dep2dep.

As @fcbr mentioned above, we are working on a Common Lisp library for working with CONLLU files. I hope to add a rule processing engine for tree transformations. So far, we are trying to define what would be the necessary expressivity for that rules language.

To define our rules language we are trying to formalize the modifications that our linguists suggest. For example https://github.com/own-pt/bosque-UD/issues/131#issuecomment-269017142

Unfortunately, in the Portuguese corpus, we are not only dealing with the V1->V2 upgrade, but we are also still correcting wrong analyses.

In the definition of our language for rules, we are stealing ideas from SPARQL, corte e costura, the bioNLP query language, INESS etc.

Each group will probably prefer one particular programming language for the implementation and some specific architecture option, but maybe we can still share ideas about, for instance, a declarative language of rewriting rules.

fginter commented 7 years ago

In case this would be relevant to anyone, I got our dep2dep thing up and running, and even made it rehang punct and cc from head to the following conjunct. So I suppose it does something, and Turku will use that. I will try to keep the config file somewhat documented and modular (general UD vs Finnish specific parts). https://github.com/TurkuNLP/dep2dep/blob/master/example_rules.lp2lp

sebschu commented 7 years ago

Sorry that this took longer than expected but I just pushed a first version of my conversion script to

https://github.com/UniversalDependencies/tools/tree/master/v2-conversion

(I put it in the tools repo, so that other people can also make edits/contribute. I hope that's okay.)

If you intend to use the script, make sure to read the README on the limitations. Also, I haven't run it on any other language than English, so make sure that you do a thorough spot-checking if you run it on other treebanks.

Let me know if you have any questions or problems getting the script to run.

jnivre commented 7 years ago

Thanks! This is really useful. Why don't you also send a message to the ud list. I think many people are waiting for something like this.

martinpopel commented 7 years ago

Another implementation (based on the @sebschu's one, thanks) is here: https://github.com/udapi/udapi-python/tree/master/udapi/block/ud

I know it's too late, but it may be still useful for someone. It is implemented using the Udapi framework (I hope the code is more readable, maintainable and powerful this way) and it supports also some edits of FEATS. I plan to add enhanced dependencies and orphan/remnant. Contributions and questions are welcome.

martinpopel commented 7 years ago

I think we can close this issue. There are several converters in Lisp, Prolog and Python. Udapi's ud.Convert1to2 has been used for converting several treebanks. Udapi has also tools for adding SpaceAfter=No according to the raw text or according to heuristic rules. And ud.MarkBugs for syntax validation.