UniversalDependencies / UD_Russian-SynTagRus

Russian data from the SynTagRus corpus.
Other
80 stars 8 forks source link

some errors and questions #23

Open lionalion opened 6 years ago

lionalion commented 6 years ago

1) error in

 sent_id = 2003Bez_epokhi.xml_121
 text = По правде говоря, внешне оно совсем некрасиво, но его ценность определяется не столько формой, сколько содержанием.
15  столько столько CCONJ   _   _   16  cc  16:cc   _

must be

15  столько столько ADV _   Degree=Pos  16  conj    16:conj

2) error in

 sent_id = 2011Nano.xml_59
 text = Естественно, что первые десять манипуляторов при этом изготовят 10 х 10 = 100 штук манипуляторов, уменьшенных, однако, уже в 16 раз…
11  х   х   CCONJ   _   _   10  fixed   10:fixed    _

i think

11  х   х   SYM _   _   10  fixed   10:fixed    _

3) and question

 sent_id = 2006Uroki.xml_146
 text = В коньках нам Бог помог взять Журовой первую за много лет золотую медаль.
10  много   много   ADV _   Degree=Pos  11  nummod:gov  11:nummod:gov   _

but

 sent_id = 2006Veter.xml_81
 text = Несколько лет назад на окраине Новосибирска людям выделили участки для индивидуальной застройки.
1   Несколько   несколько   NUM _   Animacy=Inan|Case=Acc   2   nummod:gov  2:nummod:gov    _

Why is the word "Несколько" marked as a numeral, and the word "много" as an adverb? There are a lot of such examples.

And second question: may be better to mark

многие
многими
многим
многих
несколькими
нескольким
нескольких
нескольку
оба
обеими
обеим
обеих
обе
обоего
обоими
обоим
обоих

not NUM, but DET?

Thanks advance!

ftyers commented 5 years ago

I've modified the formatting of your post to make it slightly easier to read, I hope you don't object!

Regarding (1), it looks like a specific construction ... не столько ..., сколько ... which is (sometimes?) marked as союз in dictionaries.

Regarding (2), I think you are right, this should be SYM not CCONJ

Regarding (3), the syntax looks the same to me here, so they should be the same (probably the NUM tag), but let's ask @olesar. I don't think it should be DET as the government is different.

dan-zeman commented 5 years ago

The guidelines allow NUM only for definite quantities but несколько “a few” is indefinite and cannot be NUM under the current guidelines. Indefinite quantifiers are supposed to be DET, which is somewhat strange for this particular word, given its syntactic behavior. (But as a matter of fact, Czech několik is tagged DET although it is as strange as in Russian.)

olesar commented 4 years ago

oba means a definite quantity 'two, both'. However, the German-centric tradition suggests to tag it DET. neskol'ko and mnogo behave exactly as numerals in Russian, with respect to their paradigm structure, government properties, etc. (Zalizniak 1967). Their behaviour is different from that of the determinant-like group mnogij, nemnogij (cf. много людей vs многие люди). mnogo can have the comparative degree, which DET is unlikely to have.

Taking this into account and in order to keep the mapping betweeb RNC and UD scheme as consistent as possible, UD-SynTagRus does not follow the current guidelines (however, it follows the definition "A numeral is a word functioning most typically as a determiner, adjective or pronoun").

Btw, cross-linguistically, many numerals mean both definite and indefinite quantities, so the guidelines won't apply to them as well.

olesar commented 4 years ago

мало, немного, немало are tagged ADV in v2.5. Should we change upos to DET in this case? NB мало have a comparative form, cf. мало-меньше, много-больше.

dan-zeman commented 4 years ago

мало, немного, немало are tagged ADV in v2.5. Should we change upos to DET in this case?

I don't know :-) This area does not fit easily into the usual assumptions about standard POS categories. Comparatives are associated with adverbs more than with numerals. I bet that the words can occasionally be used as adverbial modifiers (degree of action of the verb). But when they are used to quantify a noun phrase, it would seem appropriate to treat them the same way as несколько.