Closed ftyers closed 4 years ago
Thanks for bringing this interesting issue up for discussion. I think it may be instructive to think about what we currently do for (a) compounds written as single words (without space), and (b) pro-drop.
For (a), which is found in Swedish and German, for example, we currently do nothing. Maybe this is not the ideal solution, but it suggests that, as long as a lexical analysis is reasonable, maybe it is okay to simply let the two words become one. I understand that this may be applicable to nominal incorporation but not verbal incorporation.
For (b), which is found in many languages (including Chukchi, apparently), we also currently do nothing. As a result, the representation of the argument structure is in some sense incomplete for pro-drop sentences and therefore problematic for (some) downstream applications. There have been proposals that this is something that can be fixed in enhanced dependencies, but currently this is not part of the guidelines. Therefore, I think option 2 is worth considering. The syntactic structure is then annotated as an intransitive clause, which apparently it is, but the features preserve the information that there is an incorporated object. If instead we want to represent the transitive clause structure, then option 4 is my favorite, because I think it fits best into the current guidelines.
Just my two cents ...
Thanks @jnivre for the comment. Regarding the issue of compounding in Swedish and German (or Finnish too for that matter) I agree that the current situation works quite well. In Chukchi, lexical incorporation/compounding works a bit differently (essentially Chukchi incorporates/compounds attributive modifiers of non-absolutive case nouns, so [broadly] "my big stone house is here" but "I live inmybigstonehouse."), but I'll make that into a separate issue.
I definitely think we don't want to annotate the transitive clause structure and would prefer to keep the basic dependencies reflecting the surface representation. So then the current solution would be to go with 2, and then wait to see what happens with the discussion of argument structure in enhanced dependencies? (are there any links to the proposals?) This issue could potentially be folded in with that and we could end up with something like 5 if the guidelines are changed.
By the way, for completeness I will add another thought I had last night:
Introduce a new row type in the CoNLL-U format to allow for annotating morpheme structure. This would be controversial given what we say about tokenisation (in that we say explicitly "Hence, morphological features are encoded as properties of words and there is no attempt at segmenting words into morphemes").
Example:
# sent_id = Walk:6
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1 Ынӄо ынӄо ADV _ _ 3 advmod _ Gloss=потом
2 нэмэ нэмэ ADV _ _ 3 advmod _ Gloss=опять
3 мытӈэйыттэнмык ттэн VERB _ Incorporated[obj]=Yes|Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1 0 root _ Gloss=1PL.S/A-сопка-взбираться-1PL.S/O
3:1 мыт мыт _ Number[subj]=Plur|Person[subj]=1|Tense=Aor 3:3 infl _ Gloss=1PL.S/A
3:2 ӈэйы ӈэгны NOUN _ _ 3:3 obj _ Gloss=сопка
3:3 ттэн ттэн VERB _ Incorporated[obj]=Yes|Valency=1 0 root _ Gloss=взбираться
3:4 мык мык _ _ Number[subj]=Plur|Person[subj]=1|Tense=Aor 3:3 infl _ Gloss=1PL.S/O
4 . . PUNCT _ _ 3 punct _ _
I'm not sure I'd recommend it, although it has some interesting possibilities. It could be used to optionally annotate compounds that may be ambiguous (we do this for English, but not for the other Germanic languages for example, as a result of the orthography, e.g. computer disk drive enclosure), or multiple verbal derivations (e.g. in the case of Turkish multiple causatives, see #197), incprporation of postpositions but not their complements (as in Crow), or dealing with morpheme scope hierarchies (as mentioned by Arkhangelsky and Lander, 2019).
Hi and thanks for raising this issue! I don't have a strong intuition which option is optimal, except maybe to say "don't break working stuff and be careful of modifying the format", since some tools would stop working with some of the more innovative suggestions.
I did want to point out that we have a similar situation in Coptic, which also has frequent incorporation with reduced forms for the constituent morphemes, though there the corresponding 'nouns' are usually not referenceable (so similar to "breastfeed" or "force feed", where you can no longer refer to the "force" or "breast" as "it" later on). For the Coptic Treebank we went with a version of Option 1, where we:
You can see examples here:
The latter example is what happens when you nominalize such a verb. For highly lexicalized cases you can actually get an additional non-incorporated object, so valency is not 100% reduced by the presence of incorporated objects. I should point out some considerations and consequences that this involved:
Due to the last point, I like the idea of Incorporated[obj]=Yes
, but I realize this would mean a substantial re-annotation effort for us. I think it's important to consider which standard is likely to be adopted by sufficiently many contributors. I would personally be willing to consider re-doing this for Coptic, but our TB is just 42K tokens. For others, this might be prohibitive.
So I guess for now my vote is for Option 1. I find all of the options interesting though, so thanks again for laying this out so clearly!
Hi! Thank you for bringing forth this fascinating issue. UD surely has to prove itself on structures that are still not so common among its treebanks but are linguistically quite widespread.
From what I have understood, my preference would go either to 4 or 6. The latter might be a little too bold, though, but the former, as you say, introduces a new problematic form, so I am skewed towards 6.
I might suggest just a small twist to 4 to achieve this: instead of introducing a new kind of row, let's introduce a new binary feature which might find use in such cases: Incorporated=Yes|No
, and let's just highlight the incorporated element, kind of extrapolating it, keep the verb component as it appears in the text and not use Incorporated
on the verb:
# sent_id = Walk:6:d
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1 Ынӄо ынӄо ADV _ _ 5 advmod _ Gloss=потом
2 нэмэ нэмэ ADV _ _ 5 advmod _ Gloss=опять
3-4 мытӈэйыттэнмык _ _ _ _ _ _ _ Gloss=1PL.S/A-сопка-взбираться-1PL.S/O
3 ӈэйы нэг NOUN _ Incorporated=Yes 4 obj _ Gloss=сопка|SpaceAfter=No
4 мытӈэйыттэнмык ттэн VERB _ Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1 0 root _ Gloss=взбираться-1PL.S/O|SpaceAfter=No
5 . . PUNCT _ _ 5 punct _ _
So this structure would tell us what is incorporated to what and it would deliver its "internal structure" in a pyramidal way, with a redundancy of the incorporated element which can be however singled out by means of the Incorporated
feature. This is very similar to introducing a new kind of row, but using something we already have.
From a practical point of view, this way we can extract:
obj
; please tell me if this consideration is wrong);
I agree with @Stormur's proposition. I have only one remarks: we don't really need to indicate the forms of nodes 3 and 4. In such cases, we have two lexemes which are combined in one form (3-4). We already had this kind of discussion concerning amalgams: Eng. wanna is one form which is the combination of two lexemes (want and to), but we don't need to give forms to these two lexemes. They have a common form, which is the amalgam wanna. It is even clearer with Fr. au /o/ (one phoneme), which is the amalgam of a preposition and an article.
@ftyers You need to decide how many forms you have in your example (3), how many lexemes (4), and to what lexemes the morphosyntactic features must be associated (to the VERB in case of an incorporation). The lexemes are your syntactic nodes. And after you need a way to encode this. The traditional way in UD is @Stormur's proposition, with a split node 3-4. Another way has been proposed in #683, where textform (surface form) and wordform (lexemes) are distinguished.
Thanks for this long and elaborate breakdown of the possibilities! I am strongly opposed to changing the CoNLL-U format (options 6 and 7), although option 7 might make for an interesting extension of the format outside the UD proper (something that CoNLL-U-Plus does not have yet). Even option 5 would in fact require modification of the current guidelines, although in the area of enhanced dependencies, where I think some future modification is anticipated by many of us.
Anyways, my favorite is option 2, which more or less corresponds to the "do-nothing" strategy of the German and Swedish compounds (only with an additional feature that highlights incorporation), and which maintains the intransitivity. My second pick would be option 4 but slightly modified: instead of guessing the correct surface form of the transitive verb, I would simply take the actual form and replace the incorporated object by a hyphen:
# sent_id = Walk:6:d # text = Ынӄо нэмэ мытӈэйыттэнмык. 1 Ынӄо ынӄо ADV _ _ 5 advmod _ Gloss=потом 2 нэмэ нэмэ ADV _ _ 5 advmod _ Gloss=опять 3-4 мытӈэйыттэнмык _ _ _ _ _ _ _ Gloss=1PL.S/A-сопка-взбираться-1PL.S/O|SpaceAfter=No 3 ӈэйы нэг NOUN _ Incorporated=Yes 5 obj _ Gloss=сопка|SpaceAfter=No 4 мыт-ттэнмык ттэн VERB _ Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1 0 root _ Gloss=взбираться-1PL.S/O 5 . . PUNCT _ _ 5 punct _ _
Thanks everyone!
@Stormur I liked your suggestion, but for the moment I'm wary about having an intransitive verb with an obj
dependent. Having a feature marking incorporation on the noun as well as the verb is a good idea.
@sylvainkahane I think that under the current guidelines, surface form is a mandatory column, so it would not work to leave it out. I think this is good for the basic dependencies, but I might disagree for the enhanced dependencies.
@amir-zeldes thanks for your comments, yes, one of the reasons I decided to post the issue is to get an idea of what other people are doing or thinking... I didn't expect that Coptic would have this issue, but great that it has. In terms of reannotation, could it not be done almost automatically if you have the information in the MISC field and the relation and POS tag?
I think that given the feedback, my current thinking is to:
I'm also going to note that some of the discussion in #589 seems relevant, although these aren't strictly "null" as they appear in the utterance, just not as separate "words".
@ftyers - that's what makes these issue discussions so valuable for me: I often see that people are facing similar problems, and hearing what other people are doing gives me ideas and increases convergence between the treebanks across languages. I think "almost automatically" is about right for how we could get Coptic into your proposed representation, but not 100% since:
obj
, we can't automatically guess it (but most are obj)For now though, what's stopping me from doing anything that creates a separate token (i.e. a conllu row with an ID) for the incorporated element is the incompatibility with the much larger tagged corpora: if we do this, we won't be able to use POS taggers trained on large corpora to feed parsers trained on UD Coptic. We do have a morphological analyzer, so we could add the incorporation analysis in a second step, but the resulting treebank would be out of sync with the tagged corpora's token definition, so that would be a substantial problem. However I'm happy to implement solutions that just alter key-value annotations on the incorporated token (i.e. the verb). Maybe there should be a standard for people who want to do that, in addition to a recommendation on how to do it for resources that sub-tokenize incorporation?
@amir-zeldes yes, I think having standards for both scenarios would be good.
I found another nice example today:
There are (appear to be?) three coordinated objects, the first is incorporated, the second two are not.
# text = Иниквъи ӄычавполпэрэгэм ынкъам пъомпъомъёчгынъым ынкъам эчг иԓгытэвкинэт яаёԓӄыԓтэ.
# text[phon] = inikwʔi qəsawpolpereɣeʔm ənkʔam pʔompʔomjosɣənʔəm ənkʔam esɣ iɬɣətewkinet jaajoɬqəɬte
# text[rus] = Она сказала: «Возьми мыло и корзинку для грибов и купальные принадлежности».
# text[eng] = She said: «Take a bar of soap and a basket for mushrooms as well as bathing accessories».
1 Иниквъи икын VERB _ _ 0 root _ Gloss=2/3.S/A-INV-сказать-TH-2/3SG.S
2-3 ӄычавполпэрэгэм _ _ _ _ _ _ _ _
2 ӄычавполпэрэгэ пири VERB _ Incorporated[obj]=Yes 1 parataxis _ Gloss=2.S/A.SUBJ-мыло-брать-IRR-2/3SG.S-=EMPH
2.1 чавпол чоп NOUN _ Incorporated=Yes _ _ 2:obj Gloss=мыло
3 м ъм PART _ _ 2 discourse _ _
4 ынкъам ынкъам CCONJ _ _ 5 cc _ Gloss=и
5-6 пъомпъомъёчгынъым _ _ _ _ _ _ _ _
5 пъомпъомъёчгын пъомпъомъёчгынъым NOUN _ Case=Abs|Number=Sing 2 orphan 2.1:conj Gloss=гриб-CONT-NOM.SG-=EMPH
6 ъым ъм PART _ _ 5 discourse _ _
7 ынкъам ынкъам CCONJ _ _ 10 cc _ Gloss=и
8 эчг эчг X _ _ 9 reparandum _ Gloss=FST
9 иԓгытэвкинэт _ VERB _ Case=Abs|Number=Plur 10 acl _ Gloss=мыться-REL-NOM.PL
10 яаёԓӄыԓтэ _ NOUN _ Case=Abs|Number=Plur 2 orphan 2.1:conj Gloss=использовать-PTCP.PASS-DEB-NOM.PL
11 . . PUNCT _ _ 2 punct _ _
Yes, we get these weird bracketing paradoxes sometimes as well, where you might have some modifier of the incorporated object left floating elsewhere in the sentence. You can get postponed possessives of incorporated objects like "glory-give ... of God" (give God's glory), or find phrasal verbs retaining their particle which is postponed until after the incorporated object:
nej-daimonion ebol "to demon-throw out" = "cast out demons" from nouje ebol "cast out"
I think it's actually not that strange from an English perspective: common phrasal verb objects that come right after the verb can easily get 'sucked into' the verb, but the surrounding syntax can change or stay the same. Phonologically Jen Hay has shown that phrases like "gimme a hand" have reduced articulation even on the 'hand' element.
It's just that some languages seem more comfortable about doing this, and streamline what happens more: predictable, fixed reduced forms, loss or avoidance of reference to such objects, and development of alternative constructions when you want to keep the object separately modifiable and referenceable (in Coptic you get postponed mediated objects, marked by a prefixal object marker). It's cool to see how it plays out in each language and where the construction can be pushed to the limit!
I'm currently working on annotating a corpus of Chukchi, a polysynthetic language of Siberia. The raw data is from the Amguema dialect corpus and consists of around 1000 utterances of spoken Chukchi transcribed and glossed.
One issue I have come across is the treatment of incorporation -- when two or more lexical stems can be combined to make up a single phonological word. In Chukchi this can take two forms: either nominal incorporation (similar to noun-noun or adjective-noun compounding) or verbal incorporation. In this issue I will discuss the latter. This process is not infrequent, appearing in 5-10% of verb tokens I have annotated so far.
Verbs can incorporate a range of nominals and other verb stems, including:
In the case of incorporation of the direct object, the valency of the verb decreases. That is a transitive verb becomes intransitive. Morphologically this entails a change of inflectional agreement paradigm. Note that in Chukchi, incorporated nouns can maintain reference, that is they refer to something specific in the discourse, and incorporation is used as a foregrounding/backgrounding strategy (Dunn, 1999: ch.12)
Here is one example from the corpus containing several different kinds of incorporation:
In имԓетгынтэвъыма [imɬ-et-ɣəntew-ʔə-ma], a noun имԓ [imɬ] "water" is verbalised and then incorporated with another verb гынтэв [ɣəntew] "flee" which is in a converb form with a clitic emphatic particle.
The verb тэӈыръиԓеԓьэтӄинэ [teŋə-rʔiɬe-ɬʔet-qine] "quickly raced" the adjective тэӈ "good" is incorporated with the verb base ръиԓе [rʔiɬe] "to take part in a race" as an adverbial modifier.
Finally, гынтэвтыԓеӄинэт [ɣəntewtəɬeqinet] consists of two verb stems, the first one гынтэв [ɣəntew] "flee" acts as an adverbial clause modifier of the second тыԓе [təɬe] "go".
While I have been annotating the texts I have been marking verbs with incorporation and been considering how best to annotate them according to UD guidelines. There are no current guidelines for this, as opposed to for nominal compounding or for clitics.
There are two principle approaches to incorporation in the linguistic literature, the syntactic [generativist] approach (Baker, 1988) and the lexicalist approach (Anderson, 2000) i.a. In the syntactic approach, incorporation is seen as head movement, while in the lexicalist approach it is seen as a lexical operation combining two feature structures, with the incorporated noun saturating some part of the verb's argument structure. There are formal (dependency, HPSG) approaches to a similar phenomenon in Greenlandic (Bick, 2019; Malouf, 1999)
Dunn (1999) in his Grammar of Chukchi splits incorporation into two processes, syntactic incorporation and lexical incorporation (or compounding) and states,
In Chukchi, verbs with incorporated elements are single phonological words, and inflectional morphology goes outside the incorporated element, however as Dunn points out, incorporation certainly plays a role in the syntax.
It is worth noting that if we consider these to be single syntactic words, then could cause issues for additional annotation:
And downstream tasks:
Essentially, the lexical/morphological component of UD does not appear well adapted to deal with this kind of language typology.
Manning's Law gives a number of criteria for what makes good annotation,
And additionally (7) Ginter's razor: "Complex changes should only be made when they substantially improve things".
Bearing these criteria in mind, I have considered a number of different options for annotating incorporation and come up with some advantages and disadvantages for each one. While it might be desirable and expedient to come up with an ad hoc solution for Chukchi, there are many other languages (most of them indigenous and low resource) that have this typological feature, and I would like to come up with a good general solution to avoid having to reinvent the wheel (or reannotate!) down the line.
The examples will be based on the following sentence, showing incorporation with a definite/referential noun, ӈэг "hill" from a short text (partial annotation). Note that in 1.5 the specific hill is introduced by incorporation, it is used incorporated again in 1.6, in 1.7 it is used unincorporated with oblique suffixes (
Case=Abl
,Case=Prol
) and again in 1.11.Options
Option 1:
Come up with an ad hoc representation where everything goes in the
MISC
column, do not do any specific annotation for incorporation.Example:
Advantages:
Disdvantages:
MISC
column would drastically complicate downstream tasks (6)MISC
column is not widely used by off-the-shelf parsers (4)Option 2:
Do not do any further annotation other than annotate in the morphological features that the verb has incorporated elements. This could be done with a single feature, or with layered features. Any further processing of incorporation, e.g. for use in applications would need to be done separately, with separate data.
Example:
Advantages:
Disadvantages:
Option 3:
Annotate as if annotating multiword tokens, maintaining the surface form as a concatenation of the tokens.
Example:
Advantages:
Disadvantages:
Option 4:
Annotate as if annotating multiword tokens, with the underlying tokens as the "analytic" equivalent (thanks to @dan-zeman for this one)
Example:
^ I have made this form up based on the grammar, it is likely to be incorrect
Advantages:
Disadvantages:
Valency
would be2
--- that is, without incorporation the verb should be transitive. (4)Option 5:
Annotate the basic dependencies as they are, and introduce the "incorporated" nodes in the enhanced dependencies like we deal with ellipsis. The verb with incorporated noun gets the verb root as the lemma and features according to its surface form (e.g.
Valency=1
for intransitive).Example:
Advantages:
Disadvantages:
Option 6:
Introduce a new row type in the CoNLL-U format to allow for certain kinds of syntax-like word formation, with restrictions on how it can be applied. For example it could only apply to incorporation of non-bound morphemes and parts of the word that could enter into dependency relations with other parts. The delimiter could be
:
or+
or some other symbol not used elsewhere.Example:
Advantages:
Disadvantages:
Final thoughts
My preferences are for options 5 or 6. Although, I would welcome comments and thoughts on each of the options I have outlined, and also ideas for options that I might have missed.