Closed ftyers closed 4 years ago
I have no knowledge of Chuckchi, but looking at it as an outsider, if the main issue is the lack of a space, I think a multiword token is the best choice here. We have non-spaced possessors and other modifiers in Coptic, Arabic and Hebrew, and the solution in all three has been to use multiword tokens and not treat it as incorporation.
I'd be inclined to that, but it is complicated by the fact that there can be circumpositions around the incorporated/compounded element. I was also not sure if this should be treated more like space-less nominal compounding in the Germanic languages (there are also examples of noun-noun compounds, e.g. джинсыӄонагыԓьын) or pronominal clitics in Romance languages. So far I'm leaning to the latter.
MWT analysis sounds like a good idea to me here. The fact that possessives would be annotated differently in absolutive and non-absolutive nominals itself would not bother me too much (a vague analogy could be degrees of comparison in English, which are sometimes morphological and sometimes periphrastic). But if adjectives and numerals are treated the same way then I don't think it can be captured by features. That is, if it has to be captured at all: the German-compound analogy is still an option. For what it's worth, Sanskrit also has a lot of compounds and they are often split (treated as MWT), somethimes because of more intricate internal syntax but sometimes perhaps just because it is customary in Sanskrit linguistics (see also #539).
Some other notes to the trees above:
obl
relation in CoNLL-U between гичининэт and вэривуунъыкинэт is an error, right? (The tree diagram below the CoNLL-U actually says obj
.)discourse
relation? I think that discourse serves to attach interjections at clausal level. Shouldn't the emphatic particle be attached to ытчая via case
?@dan-zeman (1) yes it should be obj
not obl
, typo, fixed. (2) I've been doing discourse
, I don't think it should be case
as it isn't really a case marking element, it's kind of like the question word or focus words -kin, -kaan etc. in Finnish, but the Finnish treebank annotates them with a morphological feature as opposed to as separate tokens. I'd be open to changing the annotation, but I'm not sure what to replace it with.
About the other points, actually I was doing the German analysis for the noun and adjective modification, I didn't really like it, but I thought "that's what UD does, if it's going to cause problems because of orthography then we just have to put up with it". But then I saw the numeral examples and reread the part in the grammar that talks about attributives and thought that it really doesn't make sense to have each combination of 1...∞ as a separate item in the lexicon!
Note, #709 discusses the second point regarding discourse
.
In Chukchi, there are two processes of adnominal modification: unincorporated and incorporated. Unincorporated means that the tokens are whitespace separated, incorporated means that the tokens are not whitespace separated (much like compounds in German or Swedish). The choice of when to use one process or the other is down to the case of the head noun. If the noun is in the absolutive case then modifiers are not incorporated, if the noun is in any other case then the modifiers are incorporated. As Dunn (1999: p.291) writes,
In the corpus I am using, I have so far come across this with adjectives, possessives and numerals.
I'll take possessives first, from this story
In principle the possessives could be done with features, as in languages which have real possessive suffixes, e.g.
Number[psor]
andPerson[psor]
. However, given that, we would get possessives as features with non-absolutive case NPs and as separate words with absolutive case NPs. I'm not sure if that is internally consistent. My current thought is to annotate it as follows,A similar thing happens with adjectives, as in this example:
And this example has two adjectives "thin" and "rubber" modifying a single noun "boots", from this text:
Here is a more contrastive example with two numerals (from the same text):
Incorporated : Ԓюутэ Абрамович ӈыронвэрталёта вакъогъэ нэмыӄэй. "Suddenly, Abramovich also landed with three helicopters"
Unincorporated: Ынкъа ԓюут ӈыръа вэрталёттэ вакъогъатӈа. "Suddenly, four helicopters landed."
Given that this phenomenon is much more transparent than the ones mentioned in #701 and #703, and does not involve argument structure/valency, I would tend to treat them using the multiword tokens that we have available in the basic dependencies. Inflectional circumfixes would then go around the head of the combined unit, as in this example from Dunn's grammar (I have not found any examples of circumfix + adnominal incorporation in the corpus I'm using so far).
(my transliteration) ... Note that this would require creating a wordform гарата "COM-house-COM", but creating noun forms is a lot more practical for an annotator than creating verb forms.
Are there any objections to this approach, or does anyone have any thoughts? Do people consider that this is sufficiently different from noun-noun compounding in Swedish/German to merit a different approach?