UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
272 stars 246 forks source link

Attributive modification as incorporation in Chukchi #704

Closed ftyers closed 4 years ago

ftyers commented 4 years ago

In Chukchi, there are two processes of adnominal modification: unincorporated and incorporated. Unincorporated means that the tokens are whitespace separated, incorporated means that the tokens are not whitespace separated (much like compounds in German or Swedish). The choice of when to use one process or the other is down to the case of the head noun. If the noun is in the absolutive case then modifiers are not incorporated, if the noun is in any other case then the modifiers are incorporated. As Dunn (1999: p.291) writes,

Incorporated Adjectives. Adjective stems must be incorporated when functioning as modifiers of non-absolutive case nouns. They are also incorporated by absolutive nouns when referring to entities of low discourse salience. Incorporation of adjectives in discussed in §9.4 [ ... ] 9.4.1 Apart from attributive adjectives, Chukchi can also incorporate other NP elements such as demonstratives and pronominal possessors. These seem like syntactic phenomena, which is a typologically very unexpected.² Any nominal with modifiers which is to act as a non-absolutive argument must use incorporation.

In the corpus I am using, I have so far come across this with adjectives, possessives and numerals.

I'll take possessives first, from this story

Captura de 2020-05-05 23-40-28

In principle the possessives could be done with features, as in languages which have real possessive suffixes, e.g. Number[psor] and Person[psor]. However, given that, we would get possessives as features with non-absolutive case NPs and as separate words with absolutive case NPs. I'm not sure if that is internally consistent. My current thought is to annotate it as follows,

# text = Гымыкытчаяыʼм вэриву вэривуунъыкинэт эти гичининэт.
# text[phon] = ɣəməkətsajaʔəm weriwu weriwuunʔəkinet эти ɣisininet
# text[rus] = Моя тётя брусничные эти (листья) собрала.
# text[eng] = My aunt gathered lingonberry leaves.
1-3     Гымыкытчаяыʼм   _       _       _       _       _       _       _       Gloss=я-POSS.INC-тётя-INS-=EMPH
1       Гымык   гым     PRON    _       Animacy=Anim|Number=Sing|Person=1|Possessive=Yes|PronType=Pers  2       nmod:poss       _       Gloss=я-POSS.INC
2       ытчая   _       NOUN    _       Case=Ins|Number=Sing    6       nsubj   _       Gloss=тётя-INS
3       ыʼм     ъм      PART    _       _       2       discourse       _       Gloss=EMPH
4       вэриву  вэриву  X       _       _       5       reparandum      _       Gloss=FST
5       вэривуунъыкинэт _       NOUN    _       _       7       obj     _       Gloss=кислый-ягода-REL-PL
6       эти     _       X       _       _       7       discourse       _       Gloss=
7       гичининэт       гичик   VERB    _       Aspect=Perf|Number[agent]=3|Number[obj]=Plur|Person[agent]=3|Person[obj]=3|Tense=Aor|Valency=2|VerbForm=Fin     0       root    _       Gloss=2/3.S/A-собирать-3SG.A.3.O-3SG.O-PL
8       .       .       PUNCT   _       _       7       punct   _       _

Captura de 2020-05-06 11-52-58

A similar thing happens with adjectives, as in this example:

Captura de 2020-05-05 19-36-16

# text = Ынкы гым тъурэтгъэк аԓваярак.
# text[phon] = ənkə ɣəm tʔuretɣʔek aɬwajarak
# text[rus] = Там я родилась, в чужой яранге.
# text[eng] = I was born there in someone else's yaranga.
1   Ынкы    ынкы    PRON    _   Case=Loc    3   obl _   Gloss=тот-LOC
2   гым гым PRON    _   Number=Sing|Person=1|PronType=Pers  3   nsubj   _   Gloss=я
3   тъурэтгъэк  _   VERB    _   Number[subj]=Sing|Person[subj]=1    0   root    _   Gloss=1SG.S/A-родиться-TH-1SG.S
4-5 аԓваярак    _   _   _   _   _   _   _   Gloss=другой-яранга-LOC
4   аԓва    _   ADJ _   _   5   amod    _   Gloss=другой|SpaceAfter=No
5   ярак    _   NOUN    _   Case=Loc    3   obl _   Gloss=яранга-LOC|SpaceAfter=No
6   .   .   PUNCT   _   _   3   punct   _   _

And this example has two adjectives "thin" and "rubber" modifying a single noun "boots", from this text:

Captura de 2020-05-06 11-39-16

1-5 Выԓгыкирзовыйчапокԓьынъыма  _   _   _   _   _   _   _   Gloss=тонкий-кирзовый-сапог-ATTR-NOM.SG-=EMPH-=PTCL
1   Выԓгы   выԓгыԓьын   ADJ _   _   3   amod    _   Gloss=тонкий
2   кирзовый    кирзовый    ADJ _   _   3   amod    _   Gloss=кризовый
3   чапокԓьын   чапок   NOUN    _   Case=Abs|Number=Sing    0   root    _   Gloss=сапог-ATTR-NOM.SG
4   ъым ъм  PART    _   _   3   discourse   _   Gloss=EMPH
5   а   а   PART    _   _   3   discourse   _   Gloss=PTCL

Captura de 2020-05-06 11-57-40

Here is a more contrastive example with two numerals (from the same text):

Incorporated : Captura de 2020-05-06 12-15-26 Ԓюутэ Абрамович ӈыронвэрталёта вакъогъэ нэмыӄэй. "Suddenly, Abramovich also landed with three helicopters"

Unincorporated: Captura de 2020-05-06 12-50-01 Ынкъа ԓюут ӈыръа вэрталёттэ вакъогъатӈа. "Suddenly, four helicopters landed."

Given that this phenomenon is much more transparent than the ones mentioned in #701 and #703, and does not involve argument structure/valency, I would tend to treat them using the multiword tokens that we have available in the basic dependencies. Inflectional circumfixes would then go around the head of the combined unit, as in this example from Dunn's grammar (I have not found any examples of circumfix + adnominal incorporation in the corpus I'm using so far).

Captura de 2020-05-06 10-59-26

1-2 гаппыԓората _   _   _   _   _   _   _   COM-little-house-COM
1   ппыԓо   ппыԓо   ADJ _   _   2   amod    _   Gloss=little
2   гарата  ра  NOUN    _   Case=Com    0   root    _   Gloss=COM-house-COM

(my transliteration) ... Note that this would require creating a wordform гарата "COM-house-COM", but creating noun forms is a lot more practical for an annotator than creating verb forms.

Are there any objections to this approach, or does anyone have any thoughts? Do people consider that this is sufficiently different from noun-noun compounding in Swedish/German to merit a different approach?


  1. Michael Dunn (1999) Grammar of Chukchi. PhD Thesis
amir-zeldes commented 4 years ago

I have no knowledge of Chuckchi, but looking at it as an outsider, if the main issue is the lack of a space, I think a multiword token is the best choice here. We have non-spaced possessors and other modifiers in Coptic, Arabic and Hebrew, and the solution in all three has been to use multiword tokens and not treat it as incorporation.

ftyers commented 4 years ago

I'd be inclined to that, but it is complicated by the fact that there can be circumpositions around the incorporated/compounded element. I was also not sure if this should be treated more like space-less nominal compounding in the Germanic languages (there are also examples of noun-noun compounds, e.g. джинсыӄонагыԓьын) or pronominal clitics in Romance languages. So far I'm leaning to the latter.

dan-zeman commented 4 years ago

MWT analysis sounds like a good idea to me here. The fact that possessives would be annotated differently in absolutive and non-absolutive nominals itself would not bother me too much (a vague analogy could be degrees of comparison in English, which are sometimes morphological and sometimes periphrastic). But if adjectives and numerals are treated the same way then I don't think it can be captured by features. That is, if it has to be captured at all: the German-compound analogy is still an option. For what it's worth, Sanskrit also has a lot of compounds and they are often split (treated as MWT), somethimes because of more intricate internal syntax but sometimes perhaps just because it is customary in Sanskrit linguistics (see also #539).

Some other notes to the trees above:

ftyers commented 4 years ago

@dan-zeman (1) yes it should be obj not obl, typo, fixed. (2) I've been doing discourse, I don't think it should be case as it isn't really a case marking element, it's kind of like the question word or focus words -kin, -kaan etc. in Finnish, but the Finnish treebank annotates them with a morphological feature as opposed to as separate tokens. I'd be open to changing the annotation, but I'm not sure what to replace it with.

About the other points, actually I was doing the German analysis for the noun and adjective modification, I didn't really like it, but I thought "that's what UD does, if it's going to cause problems because of orthography then we just have to put up with it". But then I saw the numeral examples and reread the part in the grammar that talks about attributives and thought that it really doesn't make sense to have each combination of 1...∞ as a separate item in the lexicon!

ftyers commented 4 years ago

Note, #709 discusses the second point regarding discourse.