UniversalDependencies / UD_Russian-SynTagRus

Russian data from the SynTagRus corpus.
Other
80 stars 8 forks source link

converging UD_Russian and UD_Russian-SynTagRus annotation #10

Open olesar opened 7 years ago

olesar commented 7 years ago
  1. Compound numerals (incl. cx with тысяча, миллион). Cases like "сорок пять" should be annotated as сорок >flat пять according to http://universaldependencies.org/u/dep/flat.html. In UD2.0 files: ru: сорок >compound пять, сорок <nummod пять ru-syntagrus: сорок <nummod:gov пять "Universal" approach is somewhat problematic since in двадцать один, двадцать два, двадцать три, двадцать четыре the last numeral predicts the case of the noun (cf. nummod:gov), so we will have different tags on the first numeral word depending what its dependent is. ::::: 1--4: the rules seem to be all right, but some overgeneralization happens
olesar commented 7 years ago
  1. There are cases like два девяносто (standing for 'two (roubles) 90 (kopecks)' and три двадцать (standing for 'three (hours) 20 min'). Need attention.
olesar commented 7 years ago
  1. NUM + NUM.Gen: меньше пяти, больше пяти. In UD2.0 files: ru: 6 более БОЛЕЕ ADV RBR Degree=Cmp 8 advmod 7 двух ДВА NUM CD Animacy=Inan|Case=Gen|Gender=Fem 8 compound 8 тысяч ТЫСЯЧА NOUN NN Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur 9 nummod:gov 9 человек ЧЕЛОВЕК NOUN NN Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur 5 nsubj _ SpaceAfter=No

ru-syntagrus: 5 более более ADV Degree=Cmp 7 nummod:gov 7:nummod:gov 6 пяти пять NUM Case=Gen 7 nummod 7:nummod 7 лет год NOUN _ Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur 4 obl 4:obl SpaceAfter=No

1 Более более ADV Degree=Cmp 4 nsubj 4:nsubj 2 двух два NUM Case=Gen 3 nummod 3:nummod 3 месяцев месяц NOUN Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur 1 nmod 1:nmod 4 прошло проходить VERB Aspect=Perf|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act 0 root 0:root

21 больше много NUM 23 nummod:gov 23:nummod:gov 22 300 300 NUM 23 nummod 23:nummod 23 заявок заявка NOUN Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur 20 obl 20:obl (NB different pos)

olesar commented 7 years ago

более/больше/менее/меньше should be linked to the numeral head, cf. террористов там было не более двух.

olesar commented 7 years ago
  1. Compound ordinal numerals like сорок пятый. Pose a problem as well since the last word agrees with the noun head. In UD2.0 files: ru: NA ru-syntagrus: 12 сорок сорок NUM Case=Nom 13 nummod:gov 13:nummod:gov 13 второго второй ADJ Case=Gen|Degree=Pos|Gender=Masc|Number=Sing 14 amod 14:amod 14 года год NOUN Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing 11 nmod 11:nmod
dan-zeman commented 7 years ago

I think that the ordinals are compounds:

compound(второго, сорок) amod(года, второго)

dan-zeman commented 7 years ago

Related to numerals is https://github.com/UniversalDependencies/docs/issues/455.

olesar commented 7 years ago

nmod (dep?) depending on ADJ or ADV --> obl

martinpopel commented 7 years ago

If the ADJ or ADV is a head of copula construction then you are right: such ADJ|ADV should not have nmod children, but obl. In the remaining cases, we should be careful: the ADJ could be a head of a noun phrase with elided noun and then nmod child is correct.

BTW: This is exactly the case when the nmod vs. obl distinction is needed because it cannot be reconstructed fully automatically (at least not easily).

olesar commented 7 years ago

acl with participles (single participles vs. prtcp group), advcl vs. acl. Need attention.

olesar commented 7 years ago

discourse/parataxis is tagged differently in two treebanks ==> Cross-Check Task, scheduled March 2018 ==> Olga makes two lists

olesar commented 7 years ago

vocative: check parataxis & NOUN & Animacy=Anim in ru-SynTagRus