UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

Multiword prepositions in Russian #249

Closed ftyers closed 6 years ago

ftyers commented 8 years ago

In some Russian treebanks, multiword prepositions like "в течение" (during) are tokenised as a single unit:

1   в течение   В ТЕЧЕНИЕ   PR  PR  ZERO    4   обст    _   _
2   нескольких    НЕСКОЛЬКО  NUM     NUM     gen 3   количест    _   _
3   секунд    СЕКУНДА  S   S   plfgeninan  1   предл  _   _
4   раскладывается    РАСКЛАДЫВАТЬСЯ    V   V   ipfindicpraessg3p   0   ROOT    _   _
5   .   .   S   S   SENT    4   PUNC    _   _

In different treebanks we find different ways of dealing with this kind of thing:

1) In English, "in front of":

1   Put put VERB    Mood=Imp|VerbForm=Fin   0   root
2   a   a   DET Definite=Ind|PronType=Art   4   det
3   metal   metal   NOUN    Number=Sing 4   compound
4   detector    detector    NOUN    Number=Sing 1   dobj
5   in  in  ADP _   6   case
6   front   front   NOUN    Number=Sing 1   nmod
7   of  of  ADP _   10  case
8   every   every   DET _   10  det
9   train   train   NOUN    Number=Sing 10  compound
10  station station NOUN    Number=Sing 6   nmod

2) In Czech, "v průběhu":

13  v   v   ADP AdpType=Prep|Case=Loc   15  case
14  průběhu   průběh    NOUN    Animacy=Inan|Case=Loc|Gender=Masc|Negative=Pos|Number=Sing  13  mwe
15  evoluce evoluce NOUN    Case=Gen|Gender=Fem|Negative=Pos|Number=Sing    16  nmod
16  odvětvovaly    odvětvovat VERB    Animacy=Inan|Aspect=Imp|Gender=Fem,Masc|Negative=Pos|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act 6   ccomp

Are there any cross-lingual guidelines for which structure to prefer ? In principle these are the same, with a preposition, followed by some nominal like thing and then a genitive (in English with 'of') afterwards.

Minor note: In the Czech documentation, "in contrast to" is shown marked with mwe, but in the English treebank it is annotated with the same scheme as in (2). Perhaps "because of" could be used --- this is actually annotated with mwe in English --- but then it wouldn't serve the example of "interruptible" multiword expressions.

dan-zeman commented 8 years ago

UD guidelines are quite clear in that we do not collapse MWEs into single nodes (although some datasets still do that), so the original Russian annotation of в течение as one node is not possible. Then I believe that mwe is the most straightforward solution (actually just a different means of saying that it is one unit).

I think both 1) and 2) are possible, and the difference is whether we believe that a particular expression is or is not a frozen multi-word preposition. That is clearly language-dependent but I agree that we should try to arrive at the same page at least within groups of related languages.

manning commented 8 years ago

Agree with @dan-zeman that under current guidelines, в течение as one node is not possible.

This leaves the choices of having a multiword case marker vs. using an analytic analysis as in English above, where "front" is still the syntactic head. On the one hand, from a more semantic viewpoint, viewing "in front of" as a multi-word preposition makes a lot of sense -- and Sebastian and I have actually done some work to make things like that as an "enhanced" representation for relating language and vision (upcoming LREC 2016 paper). I would be okay with that but it's tricky with mwe's as to where you stop. We tried to avoid making MWEs of things that were basically productive and you can clearly make productive spatial relations (to the left of, to the side of, on the outside of, on the tip of) but it would be reasonable to regard the ones without the in English (in front of, in back of) as mwe.

Need some general policy with an eye towards applications ... I'll assign this one to Slav.

olesar commented 8 years ago

Under the current scheme, we assign "case"/"advmod"/etc. to the first word, and "mwe" to all other words in Russian MWEs. It is a well-known problem where to draw the line between MWEs and non-MWEs. Different Russian resources provide different lists, as most MWEs are regular from the morphosyntactic viewpoint. At present we consider the merge of two lists (based on the RNC http://ruscorpora.ru/obgrams-PR.html and SynTagRus/ETAP) to be the best option.

However it would be nice to compare the lists across (at least) Slavic languages and have a more consistent policy.

Olga

03.04.2016, 23:39, "Christopher Manning" notifications@github.com:

Agree with @dan-zeman that under current guidelines, в течение as one node is not possible.

This leaves the choices of having a multiword case marker vs. using an analytic analysis as in English above, where "front" is still the syntactic head. On the one hand, from a more semantic viewpoint, viewing "in front of" as a multi-word preposition makes a lot of sense -- and Sebastian and I have actually done some work to make things like that as an "enhanced" representation for relating language and vision (upcoming LREC 2016 paper). I would be okay with that but it's tricky with mwe's as to where you stop. We tried to avoid making MWEs of things that were basically productive and you can clearly make productive spatial relations (to the left of, to the side of, on the outside of, on the tip of) but it would be reasonable to regard the ones without the in English (in front of, in back of) as mwe.

Need some general policy with an eye towards applications ... I'll assign this one to Slav.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub

Olga Lyashevskaya

School of Linguistics, Faculty of Humanities Higher School of Economics, Moscow