UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 246 forks source link

Improve cross-language consistency of adpositions (case/mark/aux) #203

Closed dan-zeman closed 6 years ago

dan-zeman commented 9 years ago

The datasets in UD 1.1 often differ in approaches to adpositions. Besides case, they are also often attached as mark or even aux. Although various languages may have their own reasons for doing things differently, there seems to be room for more harmonization.

I have put together a document with more details here:

Adpositions in Universal Dependencies.pdf

(I could copy & paste it directly to the edit window but hopefully it will be more useful in the PDF form, at least for the upcoming Uppsala meeting.)

Most of the examples in the document were collected using the PML-TQ search engine. If you want to try it yourself, go to

http://lindat.mff.cuni.cz/services/pmltq/

and select a treebank from the menu (look for "Universal Dependencies"), or construct directly URL for one of the UD languages (the last two letters are the langauge code):

http://lindat.mff.cuni.cz/services/pmltq/ud_fi/

msimi commented 9 years ago

Hi Dan,

I read the [quite useful] document and tried to reduce the non-leaves ADP's present in the Italian treebank. We still have a few exceptions to the "rule" that ADP's should be leaves. di_per_sé ed indipendentemente "di_per_sé" is and adverbial locution and "indipendentemente" a conjoined adverb: this is dependent of "di". A number of cases such as: "a il di_là del muro" where "di_là" is an adverbial locution: the complement ("muro" in this case) a dependent of "di".

Case vs mark

I fixed a conversion error which was responsible for the "case" instead of "mark" in a number of instances (most of the 84 cases found with your query). Still, as you mention, there are 47 cases such as "da il negare l'esistenza" where an infinitive verb is used with an article. I tend to regard this as reasonable.

I am waiting for the output of the Uppsala meeting before:

--- Maria

dan-zeman commented 9 years ago

Thanks, Maria. I think your non-leave examples are OK because they are a mwe. (Just the query I constructed is not sophisticated enough to eliminate multi-word expressions. But the query could be improved, too.)

At the end in Uppsala I was not in the group that discussed adpositions and I have to admit that I do not remember all their conclusions. But I think that the relation of a preposition to a non-verbal predicate of a clause (due to copula inversion) should be treated the same way as if it is attached to a verbal predicate. So if they decided it should be mark (I think they did, but let's wait for the report), it should be mark here, too.

jnivre commented 8 years ago

I was leading that group, and we identified two clear bugs in your document:

  1. In the Spanish treebank, infinitive markers still have the relation "aux" instead of "mark". This should definitely be fixed. The question is whether the postag should also be changed to PART, to align with other languages, or whether there is enough evidence that these are in fact ADP in Spanish. @ryanmcd
  2. In the German treebank, verb particles have the relation "mark" instead of "compound:prt", which is probably due to an over-generating conversion script. @slavpetrov

So, these should clearly be fixed for the next release. The other cases look like they could be permissible cases of language-specific variation, so they will require further study.

dan-zeman commented 8 years ago

I believe the a in Spanish is clearly an ADP. It is actually not an infinitive marker in the English sense—the infinitive is marked morphologically in the first place (the -ar/-er/-ir suffix). So it should be ADP and mark. @miguelballesteros , @Elena-Pascual , @hectormartinez , do you agree?

miguelballesteros commented 8 years ago

yes, -ar, -er, -ir suffixes marked the infinitive. This should be enough.

I tried to find infinitive markers withe the relation aux but I couldn't, I checked the whole dev set. Maybe if we see an example ? Also I'm confused to what is an infinitive marker, if it is the "a" in "a comprar" or "a cambiar" this is a preposition and depends on the verb that has before like "voy a cambiar", but I'm not sure what you are referring to.

About mark, this is what you can find in the doc file https://github.com/UniversalDependencies/docs/blob/pages-source/_es-dep/mark.md , this the original documentation of "mark"

On 17 September 2015 at 05:20, Dan Zeman notifications@github.com wrote:

I believe the a in Spanish is clearly an ADP. It is actually not an infinitive marker in the English sense—the infinitive is marked morphologically in the first place (the -ar/-er/-ir suffix). So it should be ADP and mark. @miguelballesteros https://github.com/miguelballesteros , @Elena-Pascual https://github.com/Elena-Pascual , @hectormartinez https://github.com/hectormartinez , do you agree?

— Reply to this email directly or view it on GitHub https://github.com/UniversalDependencies/docs/issues/203#issuecomment-141020311 .

Miguel Ballesteros http://miguelballesteros.com

jnivre commented 8 years ago

Well, separate infinitive markers are used also in languages that have an unambiguous morphological marking of the infinitive, like Swedish. However, not all languages use them. Here is a classic example:

Navigare necesse est. To sail is necessary. Att segla är nödvändigt.

Latin has a bare infinitive, but English and Swedish require a marker ("to" and "att", respectively). I guess Spanish is like latin in this respect. However, they are also used in construction like:

He began to laugh. Han började (att) skratta.

Here the marker is optional in Swedish but compulsory in English. This distinguishes full verbs with an infinitive complement from auxiliary verbs, which never take the infinitive marker:

He should laugh. Han borde skratta.

Here the marker is impossible in both English and Swedish. So note that it is the verb that determines whether the marker can/should be there, in English and Swedish as well as Spanish.

Do a search for words that have the postag ADP and the deprel aux in the Spanish treebank. This is what Dan originally noticed and I assumed that they were infinitive markers, but perhaps there is something else going on.

miguelballesteros commented 8 years ago

yes, in Spanish you only need the "-ar", "-er", "-ir" morphemes, which indicate that the verb is in infinitive as Dan mentioned above,.

and yes, there are such things words with pos=ADP and deprel=aux in the current version of the treebank,

1 Ella PRON Gender=Fem|Number=Sing|Person=3 4 nsubj 2 comenzó AUX Gender=Com|Number=Sing|Person=3|Mood=Ind|Tense=Past 4 aux 3 a ADP Gender=Com 4 aux 4 conversar VERB Gender=Com|VerbForm=Inf 0 root

31 nos PRON Gender=Com|Number=Plur|Person=1 34 iobj 32 inclina AUX Gender=Com|Number=Sing|Person=3|Mood=Ind|Tense=Pres 34 aux 33 a ADP Gender=Com 34 aux 34 pensar VERB Gender=Com|VerbForm=Inf 20 parataxis

so, we have to change them to* deprel=mark* (?) is this the correct decision? is this the same as saying that this is an infinitive marker? because if this is the case I think it is a bit wrong, although I don't know what they should be.

Let me explain a bit more, in Spanish, it works in a similar way as in English and Swedish.

He began to laugh. Han började (att) skratta. Él empezó a reír (you cannot say Él empezó reir)

He should laugh. Han borde skratta. Él debería reír.

although, the "a" is not needed to know that reír is in infinitive, you know it because "reír" is infinitive by itself, and this is why I think that "a" should not be a marker, but if we need that in order to be consistent across languages or because there is no other alternative, let's do it.

(bare in mind that I'm not a linguist...)

On 17 September 2015 at 10:07, Joakim Nivre notifications@github.com wrote:

Well, separate infinitive markers are used also in languages that have an unambiguous morphological marking of the infinitive, like Swedish. However, not all languages use them. Here is a classic example:

Navigare necesse est. To sail is necessary. Att segla är nödvändigt.

Latin has a bare infinitive, but English and Swedish require a marker ("to" and "att", respectively). I guess Spanish is like latin in this respect. However, they are also used in construction like:

He began to laugh. Han började (att) skratta.

Here the marker is optional in Swedish but compulsory in English. This distinguishes full verbs with an infinitive complement from auxiliary verbs, which never take the infinitive marker:

He should laugh. Han borde skratta.

Here the marker is impossible in both English and Swedish. So note that it is the verb that determines whether the marker can/should be there, in English and Swedish as well as Spanish.

Do a search for words that have the postag ADP and the deprel aux in the Spanish treebank. This is what Dan originally noticed and I assumed that they were infinitive markers, but perhaps there is something else going on.

— Reply to this email directly or view it on GitHub https://github.com/UniversalDependencies/docs/issues/203#issuecomment-141097771 .

Miguel Ballesteros http://miguelballesteros.com

jnivre commented 8 years ago

Yes, this is exactly the examples I was thinking of. To me they look like infinitive markers, and then they should be ADP/mark or PART/mark. If you insist that they are prepositions, then they should be ADP/case. The point is that they should never be ADP/aux. I think it would be worth checking how these are treated in French and Italian, since the Romance languages behave similarly in this respect.

dan-zeman commented 8 years ago

@miguelballesteros : mark does not necessarily say that it is an infinitive marker (although it is used for infinitve markers, too). It is often used with subordinating conjunctions. It is also used with prepositions (instead of case) when they are attached to clausal predicates instead of nominals. (Example: English before coming here, ...) But here we are less consistent across languages at the moment. In some languages the infinitive (in contrast to finite verb forms) seems to be quite close to nouns, which would make the case label acceptable. I agree with @jnivre that we should switch from aux to something else (I personally favor mark).

miguelballesteros commented 8 years ago

Understood. Let's do it. On Sep 18, 2015 3:07 PM, "Dan Zeman" notifications@github.com wrote:

@miguelballesteros https://github.com/miguelballesteros : mark does not necessarily say that it is an infinitive marker (although it is used for infinitve markers, too). It is often used with subordinating conjunctions. It is also used with prepositions (instead of case) when they are attached to clausal predicates instead of nominals. (Example: English before coming here, ...) But here we are less consistent across languages at the moment. In some languages the infinitive (in contrast to finite verb forms) seems to be quite close to nouns, which would make the case label acceptable. I agree with @jnivre https://github.com/jnivre that we should switch from aux to something else (I personally favor mark).

— Reply to this email directly or view it on GitHub https://github.com/UniversalDependencies/docs/issues/203#issuecomment-141460863 .

dan-zeman commented 8 years ago

Note that https://github.com/UniversalDependencies/docs/issues/257 significantly overlaps with the topic of this issue, although 257 focuses specifically on Germanic languages.

dan-zeman commented 8 years ago

Spanish has been fixed in UD v 1.2. German has not (see above: zu should be mark but it is still aux). I am adding the Germanic label here too (the issue probably arises also in other languages, but since the German case has been specifically mentioned here, we should probably not close the issue before the data is fixed).

dan-zeman commented 8 years ago

Also, for reference: the Uppsala meeting had a discussion group on adpositions and the report is here.

simon-clematide commented 7 years ago

Is the following now decided for German UD 2?

dan-zeman commented 7 years ago

In general aux may be a particle, but yes, in German it should be verb. zu should be mark. I do not remember how um zu is done in the data but mark sounds reasonable to me; I could even imagine zu connected to um via mwe.

simon-clematide commented 7 years ago

Currently, the subordinating um in German is tagged as ADP (against common linguistic practice, IMHO) and connected via mark in order to distinguish it from the real preposition um. a ud 1.3 SETS query shows this. Italian per and French pour follow the ADP <mark _ scheme, Latin ut has SCONJ <mark _.

For me, it seems a bit strange to have a dedicated UPOS tag SCONJ available, but still use the more far-fetched ADP. The preposition and the subordinating conjunction clearly have a very different meaning and syntactic function.

mwe would definitely not be an option for me. It is a very transparent syntactic construction with a transparent semantic meaning (purpose) that can be combined with any verb (no fixedness whatsoever).

For German, the Latin scheme SCONJ <mark VERB seems most appropriate for this construction. Actually, there are 2 sentence-initial cases in UD-DE 1.3 annotated in this way.

jnivre commented 7 years ago

I think the use of ADP for "um" stems from the ambiguous use of IN for both prepositions and subordinating conjunctions in the English Penn Treebank. The German UD treebank was originally annotated (before UD existed) by a team whose instructions were to follow the guidelines for English Stanford Dependencies as closely as possible but modify the guidelines when needed. This seems to be a case where they should have modified the guidelines but didn't. The use of "aux" for the infinitive marker is an even more blatant case that has been known for a long time but unfortunately has never been fixed. Note that this has nothing to do with v1 and v2. It should have been fixed already in v1. This all goes back to the fact that nobody is actively maintaining the German UD treebank. Several people have expressed interest in doing this, but unfortunately nobody seems to really have the time to do it. :(

simon-clematide commented 7 years ago

I fully agree that the overloading of IN as a kind of "preposition" for verbs from the Penn Treebank shows through this annotation philosophy. We at the Department of Computational Linguistics in Zurich have decided to use Universal Dependencies in our courses and practical work with our students. We are also willing to push the consistency and quality of this resource further in order to show them a good example of modern computational linguistics annotation practice (and of course, NLP tools that can be derived from them, e.g. an interactive Maltparser demo for German built with the UD V1.3). The UD Treebank is also interesting for us because it departs from the typical orthographically well-edited newspaper content we have in most of the other available German Treebanks.

If there is already an agreement on some straightforward transformation (I would say that my initial comment in this thread fits #257), I would volunteer to apply them. Because it also costs me time to explain our students why some things don't look the way a linguistically informed person would like to have them look like...

dan-zeman commented 7 years ago

@jnivre : I think the distinction between ADP and SCONJ is also related to our discussion about the extent to which we want the POS tag to be determined by syntactic function. Um can indeed function as a preposition in a nominal phrase but I agree that it is different from the um zu + infinitive construction. The STTS tagset used in Tiger is very functionally oriented, so it would be unthinkable there not to distinguish these two cases. (In HamleDT <= Tiger, we have 1310 occurrences as preposition, 558 as subordinating conjunction, 43 as separable verb prefix and 6 as adverb.)

@simon-clematide : When I said mwe I was not talking about the relation between um and the verb but about a possible relation between um and zu; the whole thing would then be attached as mark to the predicate of the subordinate clause.

I invested some time into improving the German UD back in August, and I have fixed the aux label of zu, which also bugged me a lot. But there definitely is a lot more room for improvement; in particular, I did not do anything with the right-headed names etc. because I thought it would be better to wait for the next version of the guidelines.

simon-clematide commented 7 years ago

@dan-zeman Ok, now I found the dev branch of the German UD with your recent commits, great:-) I will look into that.

Regarding mwe, ok, I understand. Still, um and zu can be in contact position sometimes, but there can be an unrestricted amount of constituents be between them. Therefore mwe is still not applicable for me in this case.

Regarding name: Yes, we also saw that there is a bit of an inconsistency regarding the universal guidelines and the more or less non-existing language-specific German ones... The topic, however, is indeed not trivial in German if we take the concept of the syntactic head of a multiword name seriously. But maybe we should have a discussion about that in another thread.

jnivre commented 7 years ago

@simon-clematide @dan-zeman @slavpetrov @wbwseeker I think it's great if you are willing to put some effort into fixing some of the "bugs" in the German UD treebanks. I also have a German student here in Uppsala, who might be interested in contributing. She noticed, for example, that verb particles are annotated as "mark" instead of "compound:prt" as in English and the Scandinavian languages (unless Dan has already fixed this in the dev branch). I am also directing this to Slav Petrov, who has been responsible for German UD but haven't had any time to work on it, and to Wolfgang Seeker, who has expressed interest in contributing as well. Hopefully, we can come up with some kind of coordinated effort.

dan-zeman commented 7 years ago

@simon-clematide : Regarding mwe, on a second thought I came to the same conclusion but you were faster to write it :-) Let's forget mwe here.

Regarding compound:prt, it was on my to-do list but today I realized that I did not fix it in August. Fixed (almost—I rely on a tagger's opinion on what's PTKVZ, so there will be a few errors) and committed a minute ago.

simon-clematide commented 7 years ago

Oh, yes, that was another issue. Great it is fixed!

I was wondering whether there is a language-specific agreement regarding the UPOS tag of a verb particle (given that in the annotation schemes I know best, this words would have been treated as particles). The UD guidelines state:

Note that the PART tag does not cover so-called verbal particles in Germanic languages, as in give in or end up. These are adpositions or adverbs by origin and are tagged accordingly ADP or ADV. Separable verb prefixes in German are treated analogically.

As I interpret the current annotation, there is a global lexical preference relation which says:

Maybe, we could state the above criterion for German without referring to a somewhat unclear reference to word origins.

I checked the current particles and compared them to the list of prepositions collected by the institut für deutsche sprache

The results are almost always inline with the output of: ud-de-dev-branch $ grep compound:prt *.conllu |cut -f 2,4 |sort |uniq -c |sort -rn

There are a few cases which I spotted where there actually exists a German preposition (or some tagging errors show up). Do you agree to the following changes?

*   5 wider ADV => ADP
*   3 zu    PART => ADP
*   3 nahe  ADV => ADP
*   3 gegenüber    ADV => ADP
*   3 fern  ADV => ADP
?   2 frei  ADV  => ADP (although rarely used and only as a postposition)
*   2 entgegen  ADV => ADP
*   1 voran ADP => ADV
*   1 gleich    ADV => ADP
*   1 bevor SCONJ => ADV

Then, there are a few glitches where postpositions and other words still have a fine-grained STTS tag PTKVZ (should be APPO or APZR).

ud-de-dev-branch $ grep -nP "PTKVZ.+case" *.conllu
de-ud-dev.conllu:566:7  aus aus ADP PTKVZ   _   6   case    _   _
de-ud-dev.conllu:4932:7 mit mit ADP PTKVZ   _   10  case    _   _
de-ud-dev.conllu:9754:4 nach    nach    ADP PTKVZ   _   3   case    _   _
de-ud-test.conllu:2416:6    durch   durch   ADP PTKVZ   _   3   case    _   _
de-ud-test.conllu:17952:7   abwärts    abwärts    ADP PTKVZ   _   4   case    _   _
de-ud-train.conllu:8316:3   nach    nach    ADP PTKVZ   _   6   case    _   _
de-ud-train.conllu:9327:3   nach    nach    ADP PTKVZ   _   2   case    _   _
de-ud-train.conllu:14296:3  aus aus ADP PTKVZ   _   4   case    _   _
de-ud-train.conllu:44112:8  mit mit ADP PTKVZ   _   10  case    _   _
de-ud-train.conllu:65876:26 statt   statt   ADP PTKVZ   _   35  case    _   _
de-ud-train.conllu:75573:12 aus aus ADP PTKVZ   _   15  case    _   _
de-ud-train.conllu:81758:5  mit mit ADP PTKVZ   _   3   case    _   _
de-ud-train.conllu:103304:29    nach    nach    ADP PTKVZ   _   30  case    _   _
de-ud-train.conllu:108120:25    auf auf ADP PTKVZ   _   26  case    _   _
de-ud-train.conllu:156494:34    mit mit ADP PTKVZ   _   33  case    _   _
de-ud-train.conllu:187083:6 mit mit ADP PTKVZ   _   7   case    _   _
de-ud-train.conllu:209441:10    auf auf ADP PTKVZ   _   14  case    _   _
de-ud-train.conllu:209447:16    auf auf ADP PTKVZ   _   20  case    _   _
de-ud-train.conllu:215266:34    mit mit ADP PTKVZ   _   39  case    _   _
de-ud-train.conllu:220025:6 über   über   ADP PTKVZ   _   13  case    _   _
de-ud-train.conllu:238138:18    an  an  ADP PTKVZ   _   28  case    _   _
de-ud-train.conllu:257937:8 statt   statt   ADP PTKVZ   _   20  case    _   _
de-ud-train.conllu:269216:20    statt   statt   ADP PTKVZ   _   24  case    _   _
de-ud-train.conllu:278315:10    nach    nach    ADP PTKVZ   _   13  case    _   _
de-ud-train.conllu:282499:7 ab  ab  ADP PTKVZ   _   11  case    _   _
de-ud-train.conllu:283507:9 auf auf ADP PTKVZ   _   16  case    _   _
de-ud-train.conllu:288192:20    nach    nach    ADP PTKVZ   _   19  case    _   _```
jnivre commented 7 years ago

I don't think there is a universal preference for ADP over ADV. This has to be settled on a language-specific basis. Verb particles can even be ADJ sometimes ("cut short"). The basic idea is that we regard "verb particle" as a syntactic function, not a part-of-speech category. Different languages allow different word classes to have this function (none at all in most languages).

gossebouma commented 7 years ago

In general, I think Dutch annotation should be very comparable to that for German. However, a quick check for verbal particles for the Dutch LassySmall corpus gives (we are not using compound for anything else I think):

grep compound *.conllu |cut -f 4 |sortnr 762 PART 178 ADP 176 ADV 134 NOUN 98 ADJ 14 VERB 2 DET

We use PART for the 'prepositional' particles, and apparently, the original annotation also is quite liberal in admitting other constituents as part of a verb (there is a single relation for all dependents of (complex, multi-word) verbal expressions, that is the source of this). Interestingly, we also have compount,ADP cases, in complex verbal expressions like 'in gebruik nemen' (in use take, i.e. open, make available). There is a logic here, but I am happy to adjust the POS-tags for the sake of consistency.

simon-clematide commented 7 years ago

Ok, then the universal guidelines are somewhat incomplete by not mentioning adjectives (schön|reden in German) or even nouns (staub|saugen in German). But we still need a guideline for German, I would say. If verb particles are not pos-tagged by a special tag as in all other German Treebanks, we have to resort to other POS tags. Origin or semantics are not very clear-cut to me, and verb particles can transport quite different meanings compared to their prepositional variants (statt|geben). A global lexical preference relation would at least clarify some of the particles that might appear in different functions.

jnivre commented 7 years ago

I completely agree. This is why we in general need language-specific guidelines. The universal guidelines lay down the general principles. The language-specific guidelines describe how this principles have been implemented in a specific language. Ideally, the language-specific implementation should be consistent across related languages (like German and Dutch) and making this explicit will make it easier to identify inconsistencies. I would suggest putting this information into the language-specific documentation of compound:prt: http://universaldependencies.org/de/dep/compound-prt.html.

dan-zeman commented 7 years ago

PTKVZ+case changed to APPR, APPO or APZR.

I did not change the UPOS anywhere. I am not sure whether there is a consensus but I also do not have more time now.

simon-clematide commented 7 years ago

@dan-zeman: Great! Things are going very fast now:-)

@jnivre: I could contribute to the language-specific documentation

However, this documentation would be for Version 1.4 of the German UD and it would not reflect the situation of the current release 1.3. Will there be an official 1.4 release before the already announced Version 2?

Alternatively, we could stick to formulations like: Starting from Version 1.4, the subordinating conjunction um that introduces an infinitive sub-clause has UPOS tag 'SCONJ' and depends as mark from the head of the sub-clause.

jnivre commented 7 years ago

There will be a 1.4 release for those who wish to release new or improved data, but we recommend people to devote their efforts to v2. This has been announced on the UD mailing list. Aren't you a subscriber?

dan-zeman commented 7 years ago

@simon-clematide : We take documentation (guidelines) as describing the desired state. If the current data differs from the desired state because we were not able to make it comply with the guidelines, there should be a "Diff" section for the particular treebank on the doc page(s) to which the difference applies. However, we rarely refer to the previous release if we know that it is going to be fixed in the next one. It is not ideal that there is a temporary inconsistency between documentation and data, but it has the advantage that we do not have to remember to modify the documentation after the release.