UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
267 stars 245 forks source link

Treatment of syntactic incorporation #701

Closed ftyers closed 4 years ago

ftyers commented 4 years ago

I'm currently working on annotating a corpus of Chukchi, a polysynthetic language of Siberia. The raw data is from the Amguema dialect corpus and consists of around 1000 utterances of spoken Chukchi transcribed and glossed.

One issue I have come across is the treatment of incorporation -- when two or more lexical stems can be combined to make up a single phonological word. In Chukchi this can take two forms: either nominal incorporation (similar to noun-noun or adjective-noun compounding) or verbal incorporation. In this issue I will discuss the latter. This process is not infrequent, appearing in 5-10% of verb tokens I have annotated so far.

Verbs can incorporate a range of nominals and other verb stems, including:

In the case of incorporation of the direct object, the valency of the verb decreases. That is a transitive verb becomes intransitive. Morphologically this entails a change of inflectional agreement paradigm. Note that in Chukchi, incorporated nouns can maintain reference, that is they refer to something specific in the discourse, and incorporation is used as a foregrounding/backgrounding strategy (Dunn, 1999: ch.12)

Here is one example from the corpus containing several different kinds of incorporation:

Имԓетгынтэвъыма         амноӈэты    рэмкына             тэӈыръиԓеԓьэтӄинэ         гынтэвтыԓеӄинэт.
imɬetɣəntewʔəma         amnoŋetə    remkəna             teŋərʔiɬeɬʔetqine         ɣəntewtəɬeqinet.
Имԓ-ет-гынтэв-ма=ъым    амноӈ-эты   рэмкы-н=а           тэӈы-ръиԓе-ԓьэт-ӄинэ       гынтэв-тыԓе-ӄинэ-т
water-VB-flee-SIM=EMPH  tundra-DAT  people-ABS.SG=PART  GOOD-race-PLAC-ST3SG=PART flee-go-ST3SG-PL
"Running away from the flood, people very quickly went to the tundra."

In имԓетгынтэвъыма [imɬ-et-ɣəntew-ʔə-ma], a noun имԓ [imɬ] "water" is verbalised and then incorporated with another verb гынтэв [ɣəntew] "flee" which is in a converb form with a clitic emphatic particle.

The verb тэӈыръиԓеԓьэтӄинэ [teŋə-rʔiɬe-ɬʔet-qine] "quickly raced" the adjective тэӈ "good" is incorporated with the verb base ръиԓе [rʔiɬe] "to take part in a race" as an adverbial modifier.

Finally, гынтэвтыԓеӄинэт [ɣəntewtəɬeqinet] consists of two verb stems, the first one гынтэв [ɣəntew] "flee" acts as an adverbial clause modifier of the second тыԓе [təɬe] "go".

While I have been annotating the texts I have been marking verbs with incorporation and been considering how best to annotate them according to UD guidelines. There are no current guidelines for this, as opposed to for nominal compounding or for clitics.

The universal dependency annotation is based on a lexicalist view of syntax, which means that dependency relations hold between words. Hence, morphological features are encoded as properties of words and there is no attempt at segmenting words into morphemes.

There are two principle approaches to incorporation in the linguistic literature, the syntactic [generativist] approach (Baker, 1988) and the lexicalist approach (Anderson, 2000) i.a. In the syntactic approach, incorporation is seen as head movement, while in the lexicalist approach it is seen as a lexical operation combining two feature structures, with the incorporated noun saturating some part of the verb's argument structure. There are formal (dependency, HPSG) approaches to a similar phenomenon in Greenlandic (Bick, 2019; Malouf, 1999)

Dunn (1999) in his Grammar of Chukchi splits incorporation into two processes, syntactic incorporation and lexical incorporation (or compounding) and states,

From a syntactic point of view, incorporation occurs in Chukchi as a way of resolving tensions between the syntactic functions of discourse elements and their pragmatic statuses. The absolutive case role has a privileged position in the language as the way of presenting salient/topical information. Only in the absolutive can nominal constituents be represented by syntactic phrases (and thus have the greatest grammatical possibilities for combining with modifiers; §9), and absolutive case nominals have greater grammatical specification, marking more grammatical categories than other nominals. However, the underlying undergoer nominal (O) of a transitive verb stem often has low discourse salience; there is an anthropocentric bias towards human actors (syntactic A) as protagonists in narratives. This conflicts with the pragmatic function of the absolutive case (the case for O/S), which is to refer to arguments of high discourse salience, high animacy, specificity, etc. This tension can be resolved by incorporation of the O into the verb, thus changing the syntactic role of the A nominal to S (Dunn, 1999: §12.1.1)

In Chukchi, verbs with incorporated elements are single phonological words, and inflectional morphology goes outside the incorporated element, however as Dunn points out, incorporation certainly plays a role in the syntax.

It is worth noting that if we consider these to be single syntactic words, then could cause issues for additional annotation:

And downstream tasks:

Essentially, the lexical/morphological component of UD does not appear well adapted to deal with this kind of language typology.

Manning's Law gives a number of criteria for what makes good annotation,

  1. UD needs to be satisfactory for analysis of individual languages.
  2. UD needs to be good for linguistic typology.
  3. UD must be suitable for rapid, consistent annotation.
  4. UD must be suitable for computer parsing with high accuracy.
  5. UD must be easily comprehended and used by a non-linguist.
  6. UD must provide good support for downstream NLP tasks.

And additionally (7) Ginter's razor: "Complex changes should only be made when they substantially improve things".

Bearing these criteria in mind, I have considered a number of different options for annotating incorporation and come up with some advantages and disadvantages for each one. While it might be desirable and expedient to come up with an ad hoc solution for Chukchi, there are many other languages (most of them indigenous and low resource) that have this typological feature, and I would like to come up with a good general solution to avoid having to reinvent the wheel (or reannotate!) down the line.

The examples will be based on the following sentence, showing incorporation with a definite/referential noun, ӈэг "hill" from a short text (partial annotation). Note that in 1.5 the specific hill is introduced by incorporation, it is used incorporated again in 1.6, in 1.7 it is used unincorporated with oblique suffixes (Case=Abl, Case=Prol) and again in 1.11.

Ынӄо   нэмэ      мытӈэйыттэнмык.
Ынӄо   нэмэ      мыт-ӈэйы-ттэн-мык.
Then   again     1PL.S/A-hill-climb.up-1PL.S/O
"Then we climbed the hill again."

Options

Option 1:

Come up with an ad hoc representation where everything goes in the MISC column, do not do any specific annotation for incorporation.

Example: walk_a

# sent_id = Walk:6:a
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1       Ынӄо    ынӄо    ADV     _       _       3       advmod  _       Gloss=потом
2       нэмэ    нэмэ    ADV     _       _       3       advmod  _       Gloss=опять
3       мытӈэйыттэнмык  ӈэйыттэн        VERB    _       Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1      0       root    _       Gloss=1PL.S/A-сопка-взбираться-1PL.S/O|Incorporated[obj]=ӈэйы|VerbRoot=ттэн
4       .       .       PUNCT   _       _       3       punct   _       _

Advantages:

Disdvantages:

Option 2:

Do not do any further annotation other than annotate in the morphological features that the verb has incorporated elements. This could be done with a single feature, or with layered features. Any further processing of incorporation, e.g. for use in applications would need to be done separately, with separate data.

Example: walk_b

# sent_id = Walk:6:b
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1       Ынӄо    ынӄо    ADV     _       _       3       advmod  _       Gloss=потом
2       нэмэ    нэмэ    ADV     _       _       3       advmod  _       Gloss=опять
3       мытӈэйыттэнмык  ттэн        VERB    _       Incorporated[obj]=Yes|Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1      0       root    _       Gloss=1PL.S/A-сопка-взбираться-1PL.S/O
4       .       .       PUNCT   _       _       3       punct   _       _

Advantages:

Disadvantages:

Option 3:

Annotate as if annotating multiword tokens, maintaining the surface form as a concatenation of the tokens.

Example: walk_c

# sent_id = Walk:6:c
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1       Ынӄо    ынӄо    ADV     _       _       5       advmod  _       Gloss=потом
2       нэмэ    нэмэ    ADV     _       _       5       advmod  _       Gloss=опять
3-5    мытӈэйыттэнмык    _    _    _    _    _    _    _    Gloss=1PL.S/A-сопка-взбираться-1PL.S/O
3       мыт    _    X    _    _    5    dep    _    Gloss=1PL.S/A|SpaceAfter=No
4       ӈэйы    нэг    NOUN    _    _    5    obj    _    Gloss=сопка|SpaceAfter=No
5       ттэнмык  ттэн        VERB    _      Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1      0       root    _       Gloss=взбираться-1PL.S/O|SpaceAfter=No
4       .       .       PUNCT   _       _       5       punct   _       _

Advantages:

Disadvantages:

Option 4:

Annotate as if annotating multiword tokens, with the underlying tokens as the "analytic" equivalent (thanks to @dan-zeman for this one)

Example: walk_d

# sent_id = Walk:6:d
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1       Ынӄо    ынӄо    ADV     _       _       5       advmod  _       Gloss=потом
2       нэмэ    нэмэ    ADV     _       _       5       advmod  _       Gloss=опять
3-4    мытӈэйыттэнмык    _    _    _    _    _    _    _    Gloss=1PL.S/A-сопка-взбираться-1PL.S/O
3       ӈэйы    нэг    NOUN    _    _    5    obj    _    Gloss=сопка|SpaceAfter=No
4       мытыттэнгъын^  ттэн        VERB    _      Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1      0       root    _       Gloss=взбираться-1PL.S/O|SpaceAfter=No
5       .       .       PUNCT   _       _       5       punct   _       _

^ I have made this form up based on the grammar, it is likely to be incorrect

Advantages:

Disadvantages:

Option 5:

Annotate the basic dependencies as they are, and introduce the "incorporated" nodes in the enhanced dependencies like we deal with ellipsis. The verb with incorporated noun gets the verb root as the lemma and features according to its surface form (e.g. Valency=1 for intransitive).

Example: walk_e

# sent_id = Walk:6
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1   Ынӄо    ынӄо    ADV _   _   3   advmod  _   Gloss=потом
2   нэмэ    нэмэ    ADV _   _   3   advmod  _   Gloss=опять
3   мытӈэйыттэнмык  ттэн    VERB    _   Incorporated[obj]=Yes|Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1  0   root    _   Gloss=1PL.S/A-сопка-взбираться-1PL.S/O
3.1 ӈэйы    ӈэг NOUN    _   _   _   _   3:obj   Gloss=сопка
4   .   .   PUNCT   _   _   3   punct   _   _

Advantages:

Disadvantages:

Option 6:

Introduce a new row type in the CoNLL-U format to allow for certain kinds of syntax-like word formation, with restrictions on how it can be applied. For example it could only apply to incorporation of non-bound morphemes and parts of the word that could enter into dependency relations with other parts. The delimiter could be : or + or some other symbol not used elsewhere.

Example:

# sent_id = Walk:6
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1   Ынӄо    ынӄо    ADV _   _   3   advmod  _   Gloss=потом
2   нэмэ    нэмэ    ADV _   _   3   advmod  _   Gloss=опять
3   мытӈэйыттэнмык  ттэн    VERB    _   Incorporated[obj]=Yes|Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1  0   root    _   Gloss=1PL.S/A-сопка-взбираться-1PL.S/O
3:1 ӈэйы    ӈэг NOUN    _   _   3:2 obj _   Gloss=сопка
3:2 _   ттэн    VERB    _   _   0   root    _   Gloss=взбираться
4   .   .   PUNCT   _   _   3   punct   _   _

Advantages:

Disadvantages:

Final thoughts

My preferences are for options 5 or 6. Although, I would welcome comments and thoughts on each of the options I have outlined, and also ideas for options that I might have missed.


  1. Michael Dunn (1999) Grammar of Chukchi. PhD Thesis
  2. Sara Rosen (1989) "Two Types of Noun Incorporation: A Lexical Analysis". Language
  3. Stephen Anderson (2000) "Some Lexicalist Remarks on Incorporation Phenomena.'' Studia Grammatica 45:123-142.
  4. Bick, E. (2019). "Dependency Trees for Greenlandic". Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019). 140-148
  5. Mark Baker (1988) The Polysynthesis Parameter.
  6. Robert Malouf (1999) "West Greenlandic noun incorporation in a monohierarchical theory of grammar." In Gert Webelhuth, Andreas Kathol, and Jean-Pierre Koenig (ed.), Lexical and Constructional Aspects of Linguistic Explanation. 47-62.
jnivre commented 4 years ago

Thanks for bringing this interesting issue up for discussion. I think it may be instructive to think about what we currently do for (a) compounds written as single words (without space), and (b) pro-drop.

For (a), which is found in Swedish and German, for example, we currently do nothing. Maybe this is not the ideal solution, but it suggests that, as long as a lexical analysis is reasonable, maybe it is okay to simply let the two words become one. I understand that this may be applicable to nominal incorporation but not verbal incorporation.

For (b), which is found in many languages (including Chukchi, apparently), we also currently do nothing. As a result, the representation of the argument structure is in some sense incomplete for pro-drop sentences and therefore problematic for (some) downstream applications. There have been proposals that this is something that can be fixed in enhanced dependencies, but currently this is not part of the guidelines. Therefore, I think option 2 is worth considering. The syntactic structure is then annotated as an intransitive clause, which apparently it is, but the features preserve the information that there is an incorporated object. If instead we want to represent the transitive clause structure, then option 4 is my favorite, because I think it fits best into the current guidelines.

Just my two cents ...

ftyers commented 4 years ago

Thanks @jnivre for the comment. Regarding the issue of compounding in Swedish and German (or Finnish too for that matter) I agree that the current situation works quite well. In Chukchi, lexical incorporation/compounding works a bit differently (essentially Chukchi incorporates/compounds attributive modifiers of non-absolutive case nouns, so [broadly] "my big stone house is here" but "I live inmybigstonehouse."), but I'll make that into a separate issue.

I definitely think we don't want to annotate the transitive clause structure and would prefer to keep the basic dependencies reflecting the surface representation. So then the current solution would be to go with 2, and then wait to see what happens with the discussion of argument structure in enhanced dependencies? (are there any links to the proposals?) This issue could potentially be folded in with that and we could end up with something like 5 if the guidelines are changed.

By the way, for completeness I will add another thought I had last night:

Option 7:

Introduce a new row type in the CoNLL-U format to allow for annotating morpheme structure. This would be controversial given what we say about tokenisation (in that we say explicitly "Hence, morphological features are encoded as properties of words and there is no attempt at segmenting words into morphemes").

Example:

# sent_id = Walk:6
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1   Ынӄо    ынӄо    ADV _   _   3   advmod  _   Gloss=потом
2   нэмэ    нэмэ    ADV _   _   3   advmod  _   Gloss=опять
3   мытӈэйыттэнмык  ттэн    VERB    _   Incorporated[obj]=Yes|Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1  0   root    _   Gloss=1PL.S/A-сопка-взбираться-1PL.S/O
3:1 мыт мыт _   Number[subj]=Plur|Person[subj]=1|Tense=Aor      3:3 infl    _   Gloss=1PL.S/A
3:2 ӈэйы    ӈэгны   NOUN    _   _   3:3 obj _   Gloss=сопка
3:3 ттэн    ттэн    VERB    _   Incorporated[obj]=Yes|Valency=1 0   root    _   Gloss=взбираться
3:4 мык мык _   _   Number[subj]=Plur|Person[subj]=1|Tense=Aor  3:3 infl    _   Gloss=1PL.S/O
4   .   .   PUNCT   _   _   3   punct   _   _

I'm not sure I'd recommend it, although it has some interesting possibilities. It could be used to optionally annotate compounds that may be ambiguous (we do this for English, but not for the other Germanic languages for example, as a result of the orthography, e.g. computer disk drive enclosure), or multiple verbal derivations (e.g. in the case of Turkish multiple causatives, see #197), incprporation of postpositions but not their complements (as in Crow), or dealing with morpheme scope hierarchies (as mentioned by Arkhangelsky and Lander, 2019).

amir-zeldes commented 4 years ago

Hi and thanks for raising this issue! I don't have a strong intuition which option is optimal, except maybe to say "don't break working stuff and be careful of modifying the format", since some tools would stop working with some of the more innovative suggestions.

I did want to point out that we have a similar situation in Coptic, which also has frequent incorporation with reduced forms for the constituent morphemes, though there the corresponding 'nouns' are usually not referenceable (so similar to "breastfeed" or "force feed", where you can no longer refer to the "force" or "breast" as "it" later on). For the Coptic Treebank we went with a version of Option 1, where we:

  1. Leave the incorporated verb as a single token, with a single part of speech (VERB)
  2. Put a fully segmented analysis in MISC, using a single key-value containing separators ('-')
  3. Do not 'sub-lemmatize' (so we retain the reduced forms in the lemma field)

You can see examples here:

The latter example is what happens when you nominalize such a verb. For highly lexicalized cases you can actually get an additional non-incorporated object, so valency is not 100% reduced by the presence of incorporated objects. I should point out some considerations and consequences that this involved:

  1. We had and have existing tagged material (much bigger than the TB) where these things get holistic POS, so existing taggers/corpora all treat these as one VERB
  2. We are already using the mechanism of multi-word tokens to deal with agglutination in the language, which comes on top of incorporated lexical items
  3. Our representation is not perfect since the the lemmas of the reduced forms are missing (maybe a 'sublemmas' annotation could fix that) and the grammatical relation (most often incorporated 'obj') is not explicitly represented

Due to the last point, I like the idea of Incorporated[obj]=Yes, but I realize this would mean a substantial re-annotation effort for us. I think it's important to consider which standard is likely to be adopted by sufficiently many contributors. I would personally be willing to consider re-doing this for Coptic, but our TB is just 42K tokens. For others, this might be prohibitive.

So I guess for now my vote is for Option 1. I find all of the options interesting though, so thanks again for laying this out so clearly!

Stormur commented 4 years ago

Hi! Thank you for bringing forth this fascinating issue. UD surely has to prove itself on structures that are still not so common among its treebanks but are linguistically quite widespread.

From what I have understood, my preference would go either to 4 or 6. The latter might be a little too bold, though, but the former, as you say, introduces a new problematic form, so I am skewed towards 6.

I might suggest just a small twist to 4 to achieve this: instead of introducing a new kind of row, let's introduce a new binary feature which might find use in such cases: Incorporated=Yes|No, and let's just highlight the incorporated element, kind of extrapolating it, keep the verb component as it appears in the text and not use Incorporated on the verb:

# sent_id = Walk:6:d
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1       Ынӄо    ынӄо    ADV     _       _       5       advmod  _       Gloss=потом
2       нэмэ    нэмэ    ADV     _       _       5       advmod  _       Gloss=опять
3-4    мытӈэйыттэнмык    _    _    _    _    _    _    _    Gloss=1PL.S/A-сопка-взбираться-1PL.S/O
3       ӈэйы    нэг    NOUN    _    Incorporated=Yes    4    obj    _    Gloss=сопка|SpaceAfter=No
4       мытӈэйыттэнмык  ттэн        VERB    _      Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1      0       root    _       Gloss=взбираться-1PL.S/O|SpaceAfter=No
5       .       .       PUNCT   _       _       5       punct   _       _

So this structure would tell us what is incorporated to what and it would deliver its "internal structure" in a pyramidal way, with a redundancy of the incorporated element which can be however singled out by means of the Incorporated feature. This is very similar to introducing a new kind of row, but using something we already have.

From a practical point of view, this way we can extract:

sylvainkahane commented 4 years ago

I agree with @Stormur's proposition. I have only one remarks: we don't really need to indicate the forms of nodes 3 and 4. In such cases, we have two lexemes which are combined in one form (3-4). We already had this kind of discussion concerning amalgams: Eng. wanna is one form which is the combination of two lexemes (want and to), but we don't need to give forms to these two lexemes. They have a common form, which is the amalgam wanna. It is even clearer with Fr. au /o/ (one phoneme), which is the amalgam of a preposition and an article.

@ftyers You need to decide how many forms you have in your example (3), how many lexemes (4), and to what lexemes the morphosyntactic features must be associated (to the VERB in case of an incorporation). The lexemes are your syntactic nodes. And after you need a way to encode this. The traditional way in UD is @Stormur's proposition, with a split node 3-4. Another way has been proposed in #683, where textform (surface form) and wordform (lexemes) are distinguished.

dan-zeman commented 4 years ago

Thanks for this long and elaborate breakdown of the possibilities! I am strongly opposed to changing the CoNLL-U format (options 6 and 7), although option 7 might make for an interesting extension of the format outside the UD proper (something that CoNLL-U-Plus does not have yet). Even option 5 would in fact require modification of the current guidelines, although in the area of enhanced dependencies, where I think some future modification is anticipated by many of us.

Anyways, my favorite is option 2, which more or less corresponds to the "do-nothing" strategy of the German and Swedish compounds (only with an additional feature that highlights incorporation), and which maintains the intransitivity. My second pick would be option 4 but slightly modified: instead of guessing the correct surface form of the transitive verb, I would simply take the actual form and replace the incorporated object by a hyphen:

# sent_id = Walk:6:d
# text = Ынӄо нэмэ мытӈэйыттэнмык.
1       Ынӄо    ынӄо    ADV     _       _       5       advmod  _       Gloss=потом
2       нэмэ    нэмэ    ADV     _       _       5       advmod  _       Gloss=опять
3-4     мытӈэйыттэнмык    _    _    _    _    _    _    _    Gloss=1PL.S/A-сопка-взбираться-1PL.S/O|SpaceAfter=No
3       ӈэйы    нэг    NOUN    _    Incorporated=Yes    5    obj    _    Gloss=сопка|SpaceAfter=No
4       мыт-ттэнмык  ттэн        VERB    _      Number[subj]=Plur|Person[subj]=1|Tense=Aor|Valency=1      0       root    _       Gloss=взбираться-1PL.S/O
5       .       .       PUNCT   _       _       5       punct   _       _
ftyers commented 4 years ago

Thanks everyone!

@Stormur I liked your suggestion, but for the moment I'm wary about having an intransitive verb with an obj dependent. Having a feature marking incorporation on the noun as well as the verb is a good idea.

@sylvainkahane I think that under the current guidelines, surface form is a mandatory column, so it would not work to leave it out. I think this is good for the basic dependencies, but I might disagree for the enhanced dependencies.

@amir-zeldes thanks for your comments, yes, one of the reasons I decided to post the issue is to get an idea of what other people are doing or thinking... I didn't expect that Coptic would have this issue, but great that it has. In terms of reannotation, could it not be done almost automatically if you have the information in the MISC field and the relation and POS tag?

I think that given the feedback, my current thinking is to:

I'm also going to note that some of the discussion in #589 seems relevant, although these aren't strictly "null" as they appear in the utterance, just not as separate "words".

amir-zeldes commented 4 years ago

@ftyers - that's what makes these issue discussions so valuable for me: I often see that people are facing similar problems, and hearing what other people are doing gives me ideas and increases convergence between the treebanks across languages. I think "almost automatically" is about right for how we could get Coptic into your proposed representation, but not 100% since:

  1. If the relation is not obj, we can't automatically guess it (but most are obj)
  2. If the incorporation is inside a deverbal noun, the MISC field just gives us the constituent morphemes, and doesn't tell us which ones are the verb and object. We could probably guess it fairly well, and many nominalizers come from a fixed vocabulary, so we could have a set of rules for most cases, but not 100% automatic

For now though, what's stopping me from doing anything that creates a separate token (i.e. a conllu row with an ID) for the incorporated element is the incompatibility with the much larger tagged corpora: if we do this, we won't be able to use POS taggers trained on large corpora to feed parsers trained on UD Coptic. We do have a morphological analyzer, so we could add the incorporation analysis in a second step, but the resulting treebank would be out of sync with the tagged corpora's token definition, so that would be a substantial problem. However I'm happy to implement solutions that just alter key-value annotations on the incorporated token (i.e. the verb). Maybe there should be a standard for people who want to do that, in addition to a recommendation on how to do it for resources that sub-tokenize incorporation?

ftyers commented 4 years ago

@amir-zeldes yes, I think having standards for both scenarios would be good.

I found another nice example today:

There are (appear to be?) three coordinated objects, the first is incorporated, the second two are not.

# text = Иниквъи ӄычавполпэрэгэм ынкъам пъомпъомъёчгынъым ынкъам эчг иԓгытэвкинэт яаёԓӄыԓтэ.
# text[phon] = inikwʔi qəsawpolpereɣeʔm ənkʔam pʔompʔomjosɣənʔəm ənkʔam esɣ iɬɣətewkinet jaajoɬqəɬte
# text[rus] = Она сказала: «Возьми мыло и корзинку для грибов и купальные принадлежности».
# text[eng] = She said: «Take a bar of soap and a basket for mushrooms as well as bathing accessories».
1       Иниквъи икын    VERB    _       _       0       root    _       Gloss=2/3.S/A-INV-сказать-TH-2/3SG.S
2-3     ӄычавполпэрэгэм _       _       _       _       _       _       _       _
2       ӄычавполпэрэгэ  пири    VERB    _       Incorporated[obj]=Yes   1       parataxis       _       Gloss=2.S/A.SUBJ-мыло-брать-IRR-2/3SG.S-=EMPH
2.1     чавпол  чоп     NOUN    _       Incorporated=Yes        _       _       2:obj   Gloss=мыло
3       м       ъм      PART    _       _       2       discourse       _       _
4       ынкъам  ынкъам  CCONJ   _       _       5       cc      _       Gloss=и
5-6     пъомпъомъёчгынъым       _       _       _       _       _       _       _       _
5       пъомпъомъёчгын  пъомпъомъёчгынъым       NOUN    _       Case=Abs|Number=Sing    2       orphan  2.1:conj        Gloss=гриб-CONT-NOM.SG-=EMPH
6       ъым     ъм      PART    _       _       5       discourse       _       _
7       ынкъам  ынкъам  CCONJ   _       _       10      cc      _       Gloss=и
8       эчг     эчг     X       _       _       9       reparandum      _       Gloss=FST
9       иԓгытэвкинэт    _       VERB    _       Case=Abs|Number=Plur    10      acl     _       Gloss=мыться-REL-NOM.PL
10      яаёԓӄыԓтэ       _       NOUN    _       Case=Abs|Number=Plur    2       orphan  2.1:conj        Gloss=использовать-PTCP.PASS-DEB-NOM.PL
11      .       .       PUNCT   _       _       2       punct   _       _

Captura de 2020-05-11 15-57-34

amir-zeldes commented 4 years ago

Yes, we get these weird bracketing paradoxes sometimes as well, where you might have some modifier of the incorporated object left floating elsewhere in the sentence. You can get postponed possessives of incorporated objects like "glory-give ... of God" (give God's glory), or find phrasal verbs retaining their particle which is postponed until after the incorporated object:

nej-daimonion ebol "to demon-throw out" = "cast out demons" from nouje ebol "cast out"

I think it's actually not that strange from an English perspective: common phrasal verb objects that come right after the verb can easily get 'sucked into' the verb, but the surrounding syntax can change or stay the same. Phonologically Jen Hay has shown that phrases like "gimme a hand" have reduced articulation even on the 'hand' element.

It's just that some languages seem more comfortable about doing this, and streamline what happens more: predictable, fixed reduced forms, loss or avoidance of reference to such objects, and development of alternative constructions when you want to keep the object separately modifiable and referenceable (in Coptic you get postponed mediated objects, marked by a prefixal object marker). It's cool to see how it plays out in each language and where the construction can be pushed to the limit!