UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 247 forks source link

PoS tagging of participial adjectives in Russian and other Slavic languages #348

Closed xsway closed 7 years ago

xsway commented 8 years ago

I have been looking at how participial adjectives are annotated in UD since I’d like to study them using treebanks across several Slavic languages and a uniform annotation would be very desirable for my purposes. I first looked in detail at Russian, since it's my native language.

According to the documentation http://universaldependencies.org/ru/pos/ADJ.html, deverbal adjectives (passive, present and past participial adjectives) should be tagged as ADJ. This is however not true for either Google Russian nor Syntagrus treebanks. In fact, participials are most often (but not always) tagged as VERB instead:

Examples from Syntagrus:

продолжающийся [спад] ПРОДОЛЖАТЬСЯ VERB недостающий [материал] НЕДОСТАВАТЬ VERB окружающий [мир] ОКРУЖАТЬ VERB пишущая [машинка] ПИСАТЬ VERB

[голос] доносившийся [из кабинета] ДОНОСИТЬСЯ VERB [догматизм] перерастающий [в ..] ПЕРЕРАСТАТЬ VERB [закон] регулирующий [сотрудничество] РЕГУЛИРОВАТЬ VERB

пронизывающий [декабрь] ПРОНИЗЫВАЮЩИЙ ADJ следующий [год] СЛЕДУЮЩИЙ ADJ всеобъемлющий [гуманизм] ВСЕОБЪЕМЛЮЩИЙ ADJ Обслуживающий [персонал] ОБСЛУЖИВАЮЩИЙ ADJ

That is, from what I can see, all postnominal participials are tagged as VERB. Prenominal participials are tagged sometimes as verbs, sometimes as adjectives. I don’t see what is the difference in such cases, for example for "недостающий [материал]” (VERB) vs “обслуживающий [персонал]” (ADJ).

By contrast, in Czech, the annotation seems to respect the documentation which also assigns participials to ADJ. For example, “napsané" in "na mluvnice napsané jezuity” is tagged as ADJ (with “LDeriv” field giving the corresponding verb).

I wonder what is happening in other Slavic languages? Is the PoS tag annotation choice for participials the same for other treebanks (Slovenian, Polish, ..) - both on the paper and in the implementation? If so, should Russian participials be converted to the same annotation standard?

Thanks for considering my questions and for doing impressive job at maintaining the UD project, Kristina

olesar commented 8 years ago

Hi Kristina,

just a couple of comments on your examples from SynTagRus. Actually, all postnominal uses you cite:

[голос] доносившийся [из кабинета] ДОНОСИТЬСЯ VERB [догматизм] перерастающий [в ..] ПЕРЕРАСТАТЬ VERB [закон] регулирующий [сотрудничество] РЕГУЛИРОВАТЬ VERB

are verbs rather than deverbal adjectives. Both meaning and valency patterns are preserved.

See more on diagnostic criteria of adjectivization (in Russian): http://rusgram.ru/%D0%9F%D1%80%D0%B8%D1%87%D0%B0%D1%81%D1%82%D0%B8%D0%B5#521

As for the prenominal uses, I would vote for ADJ in the case of пишущая [машинка](= a multi-word term) and perhaps недостающий [материал](one can see certain changes in valency in some cases, e.g. ?недостающий нам материал, but OK недостающий нам опыт) In more practical terms, if a deverbal adjective is not attested in the dictionary of the (dictionary-based) tagger, it is unlikely it will be tagged as ADJ in the corpus.

Best, Olga

xsway commented 8 years ago

Thanks for your reply, Olga!

The UD docs say however that the cases such as the one I cited should be tagged as ADJ: "сделанный “done” (passive participial adjective) делающий “doing” (present participial adjective, derived from present transgressive) сделавший “having done” (past participial adjective, derived from past transgressive)"

My question is whether the annotation of Russian corresponds to the documentation (it doesn't?), and how it is handled in other Slavic treebanks. The example I cite for Czech seems to be analogous to the examples I chose for Russian. Yet, the annotation is different (VERB vs ADJ).

The examples I cite and you elaborate on I think also raise the question whether some quite fine-grained distinctions (e.g. mentioned in the sections 5.2.1 and 5.2.2 of your source) should be annotated at the level of coarse universals tags.

Best, Kristina

dan-zeman commented 8 years ago

The more I think about it, the more I lean towards tagging (and lemmatizing) all Slavic participles except for the l-participle as deverbative adjectives. That does not prevent them from having verbal features such as VerbForm=Part, Tense=Pres|Past, Aspect=Imp|Perf, Voice=Act|Pass. Valency is also not a problem, deverbatives usually inherit valency, even if it is sometimes transformed to different case realizations. On the other hand, adjectives (like participles, except for the l-participle) inflect for Case, which is not true about verbs proper. Participles can be used both predicatively and attributively, just like adjectives. Treating them as verbs means we have a non-finite subordinate clause here, which sounds artificial to me in Slavic.

@xsway : regarding your question about other Slavic languages. I wrote an article about verbal forms in Slavic UD a few months ago. Most of the time it was a proposal how I would like to see the annotation (and I think my opinion on some details has shifted since I wrote the article) but there is also a brief overview of what the current (then-current) UD treebanks contained. Russian was not included in the overview, as it had not been released yet. The article is at http://ufal.mff.cuni.cz/pbml/105/art-zeman.pdf (the overview of current data is in Section 18, page 188).

olesar commented 8 years ago

@Dan: ok, got it. We will get more chaos at the morphological level (tagging ADJ with tense and voise), but I see your point. So, will we tag these participles acl or amod here:[голос] доносившийся [из кабинета] ДОНОСИТЬСЯ VERB[догматизм] перерастающий [в ..] ПЕРЕРАСТАТЬ VERB[закон] регулирующий [сотрудничество] РЕГУЛИРОВАТЬ VERB The only question left is the following:why do we still have VERBs for the attributive uses of participles in English and other non-Slavic languages?e.g.[en] really amazing the new and exciting plays done at this theatre !done.VERB[en] We've had about 5 repairs done on 3 different laptops.done.VERB[en] President Bush on Tuesday nominated two individuals to replace retiring jurists on federal courts in the Washington area.retiring.VERB[fr] ..dans des zones pensées jusqu' alors désertiques .pensées.VERB PS To be consistent, we should annotate gerundia (aka transgressives) as ADV then, right? I cannot get the point why the Czech adverbial participles aretagged VERBs.

dan-zeman commented 8 years ago

I do not believe that there is a universal rule for participles being either ADJ everywhere or VERB everywhere. We could define such a rule but I am afraid it would feel odd in some languages regardless what we do. (Maybe we need a separate tag PTCP :-)) That's why I am only aiming at Slavic-wide consistency.

Regarding converbs/transgressives, we could probably do what you suggest, although I guess I slightly prefer to leave them as VERB (but that's legacy bias from PDT). I just have not felt the urge to distinguish them from adverbial phrases because (1) there is not this case morphology; (2) there are probably no other deverbal adverbs, are there? so we do not have valency etc. with adverbs, and adverbial clauses exist anyway. But if we make converbs ADV, we will at least achieve consistency with South Slavic where they tag them so.

xsway commented 8 years ago

@dan-zeman Thank you for the link! Your paper clarifies a lot the situation and answers all my questions at the moment. And it's a very useful reference for this type of constructions in Slavic in general!

manning commented 7 years ago

I'm sure we can't aim for universal consistency on POS assignment for participles, but for the v2 docs, we should maybe try to at least have some suggestions in the universal docs as to when and why it is more appropriate to choose one POS vs. another.

dan-zeman commented 7 years ago

@manning : While I have an opinion on how to draw the line within the Slavic group, at present I do not dare to say anything more general.

West Slavic treebanks have now converged in their treatment of participles (https://github.com/UniversalDependencies/UD_Polish/commit/d7ef615c6eb4f49cb528cf3fff63b24b5d647300, https://github.com/UniversalDependencies/UD_Czech/commit/b417951cf7fe40b6adb51ceb7bdc946dafc71adf, https://github.com/UniversalDependencies/UD_Czech-CAC/commit/8dc04b2e722e7cffb3cae36882eb93e730066a45, https://github.com/UniversalDependencies/UD_Slovak/commit/408693dcdb52a85a852bbd5b4f5f8f3bfa857ce9).

Tentatively closing the issue. Feel free to reopen if something is left undecided.