UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
266 stars 242 forks source link

NPs in head-marking languages #1036

Open LanguageStructure opened 1 month ago

LanguageStructure commented 1 month ago

In head-marking languages, the core arguments are expressed as bound indexes on the predicate. Grammatical theories offer various explanations for the semantic roles of the noun phrases (NPs) related to these indexes. In Dependency Grammar (DG), it seems more appropriate to consider these NPs as dependents of the predicate. What are your thoughts on this?

Here is an example:

[ John Mary (3.SG-3.SG-Verb)]

I am referring to the roles of 'John' and 'Mary.'

dan-zeman commented 1 month ago

I suppose by "roles" you do not mean semantic roles? Those would depend on the meaning of the verb anyway. But their grammatical relations are probably nsubj and obj, assuming that the 3.SG morphemes in the verb refer to them. For example, in Plains Cree:

Cāniy kī-wīcihēw Mērīwa.

Cāniy      kī-wīcih-ē         -w   Mērī-wa
Johnny.3SG PST-help-DIR.3SG.3'-3SG Mary-3'

The verb cross-references two 3rd person core arguments, Johnny and Mary. Mary is marked as obviative (by the suffix -wa) and the verb is in direct voice, hence Johnny will be the subject and Mary will be the object.

image

LanguageStructure commented 1 month ago

Thank you! Yes, I did not mean roles as in semantic roles.
In HM languages, the bound indeed are the core arguments (subject and object). I thought about treating them as nsubj:outer or simply as dep of the predicate.

RRG, for example, has a good explanation for this phenomenon, treating the NPs as pronominal anaphoras (see Van Valin's introduction to the Handbook of RRG). This is a good fit to dep, in my opinion.

dan-zeman commented 1 month ago

Well, since the bound morphemes are just agreement markers and not pronouns (i.e., they have no nodes in UD; they result in layered morphological features of the verb node), then I do not see reasons to annotate the nouns (when they are present) as "outer" subjects or anything else.

In the Plains Cree examples above, the verb would have the following features (plus perhaps others like Mood and VerbForm):

Number[subj]=Sing|Person[obj]=3|Person[subj]=3|Tense=Past|Voice=Dir
LanguageStructure commented 1 month ago

The are not agreement markers, they are the core arguments!

dan-zeman commented 1 month ago

Or maybe they are just reflection of the core arguments on the verb. It may not be the same in every head-marking language, of course. But they would first have to be words to be nodes and core arguments. As bound morphs, they are either affixes or clitics. If they have a prescribed position close to the verbal root (possibly just with other verbal affixes in between), they are affixes and not clitics. If, on the other hand, their position is not fixed and other words can occur between them and the verb, then they are clitics and can be treated as separate syntactic nodes in UD.

Stormur commented 1 month ago

... and if they were clitics, then probably expl would be the best choice. dep should be avoided as it means giving up on annotation.

LanguageStructure commented 1 month ago

`expl' is not a good choice, because expletives occupy the position of core arguments. The NPs I refer to are not in core argument positions

Stormur commented 1 month ago

I might not understand what you mean here with position: is it an exact, fixed position like "subject is before the verb"? Else expl does not necessarily entail that that element is a core argument, just that it apparently fills a slot which is already taken by a "full" argument, as it were. This seems to be the case here!

It is also quite similar to languages where one does not really speak of head marking, such as in Italian:

Otherwise, I could comment that there is no worse choice than dep, by definition!

ftyers commented 1 month ago

In other languages with polypersonal agreement, i.a. Nahuatl (nhi, azz), Chukchi (ckt), Basque (eus), K'iche' (quc) the agreement is marked with features Number[subj]=Sing|Person[subj]=2 and Number[obj]=Sing|Person[obj]=2, and the NP arguments, if they are realised get the same function, e.g. nsubj (or csubj) or obj, ccomp etc.

amir-zeldes commented 1 month ago

We have this situation very often in Coptic and we just give the pronominal elements their proper label and the nominal ones get dislocated (like we would do English "they're my favorites, bagels")

ftyers commented 1 month ago

We have this situation very often in Coptic and we just give the pronominal elements their proper label and the nominal ones get dislocated (like we would do English "they're my favorites, bagels")

But in this case the pronouns are tokenised off and have their own node, right?

amir-zeldes commented 1 month ago

Yes, in Coptic the pronominal elements are always their own tokens (we have nominal object incorporation at times, but pronouns are distinct). It's actually pretty similar to Chadic languages like Hausa, except that in Coptic you can still get a nominal subject inside the TAM position, though it is somewhat rare. For example for the past tense marker "a" with third person nominal subject "p-rōme" or "f" (pronoun):

In languages where the 2nd option doesn't exist (and in Coptic it gets increasingly rare with time), we could argue about whether that's agreement or a pronoun, but since we have option 2 and there are also other good reasons to think of them as pronouns (which they are historically), we analyze them as separate nodes.

Stormur commented 1 month ago

But why dislocation if it is so common? I have the impressio nthis gives a strange picture of the language, topicalising subjects so often...

amir-zeldes commented 1 month ago

But why dislocation if it is so common? I have the impressio nthis gives a strange picture of the language, topicalising subjects so often...

Yes, that's true - based on this interpretation, Coptic has a very high proportion of dislocation. The only reason is that we understood the UD guidelines to mean that this is the correct thing to do for a language that behaves that way. Using what I called option 2 above, Coptic can have sentences that look like English ones (with a lexical subject NP and no pronoun), but with time, the pronoun option became increasingly preferred, and a secondary realization of the lexical subject became popular.

In terms of word order, the position between auxiliary and verb is the canonical subject position, so it seems right to treat it as nsubj, and if there is another copy of the subject as a lexical NP somewhere, we call that dislocated. Incidentally, like in Hausa, it's possible for the dislocated subject to be a pronoun itself, then using the 'strong' pronoun form (again like in Hausa):

Since Hausa doesn't have what I referred to as 'option 2' with a lexical NP subject between the TAM and verb, it's more arguable whether we want to consider "shi" to be dislocated - it would depend on whether we subtokenize "ya" to contain a pronoun. But for Coptic, since the pronoun is not always there, we have to make it a token, and then they both need deprels. The most common option is just to have the pronoun, and it's also closest to the verb, so IMO it makes sense for it to be nsubj, and the other one is dislocated.

Stormur commented 1 month ago

It might be a redundant affix which can or cannot be there, why does it need to be a pronoun? From your examples it looks like a bound morph, always appearing in that position. If the construction with the "interposed" lexical subject is the rarer one, probably it is this one which needs to have a "deviant" annotation (in terms of dissociated nucleus?). The opposite really looks like a lectio difficilior to me. Only my 2 cents...

sylvainkahane commented 1 month ago

In some languages, it can be very difficult to decide whether NPs coreferring with a subject pronominal index are dislocated or not. In our paper at UDW 2020, we explore the case of French interrogatives (Marie est-elle là? 'Mary is she there?') and above of Wolof where most sentences contains a particle focusing one element in the clause. What is particularly complicated in Wolof is that some particles block the realization of a subject in the canonical position and the question arises whether the NP in the pre-focus position is a dislocated element or a subject.

To solve the question (supposing that the question is relevant and we need to decide) we need to have annotated data. It is why I think that it is important, in such cases, to annotate the different positions where a "subject" is realized and to have relations such as nsubj:weak, nsubj:canonical, nsubj:dislocated (or dislocated:nsubj) to indicate that we have units in different positions that are potential subjects. And why we recommend to avoid to use a plain dislocated relation for a candidate to the subject position, because it will be difficult later to use the data to solve the question. We have this problem with the Wolof treebank where never know whether dislocated elements refers to potential subjects or to other syntactic roles.

amir-zeldes commented 1 month ago

why does it need to be a pronoun? From your examples it looks like a bound morph, always appearing in that position

No, that's not the case - in the 'option 2' construction, there is no pronoun, just a lexical NP subject. If it were an inflectional morpheme, it should always be there IMO, even when the subject is lexical. What's more, the same forms appear as object pronouns, prepositional complements, etc. Cases 2 and 3 below correspond 1:1 to their Semitic equivalents, which are regularly regarded as pronouns:

  1. f-sōtm "he hears"
  2. nmma-f "with him" (cf. Arabic ma'a-hu)
  3. rat-f "his foot" (lit. foot-PRON, cf. Arabic qadamu-hu)

I don't doubt that Coptic was probably on it's way to becoming a language like Hausa, where some version of the pronoun has to be there and we could argue whether at some point it becomes inflectional, but it never made it that far before it was overtaken by Arabic as the common language of Egypt. It seems very likely that Hausa must have gone through a similar process, and there are other similarities and cognates between the two languages, but earlier stages are not documented since we don't have Hausa texts before the 15th century.

I think that it is important, in such cases, to annotate the different positions where a "subject" is realized and to have relations such as nsubj:weak, nsubj:canonical, nsubj:dislocated (or dislocated:nsubj)

Agreed, I think that makes a lot of sense, especially if the list of possible dislocation types is limited and they are frequent in the language.

Stormur commented 1 month ago

in such cases, to annotate the different positions where a "subject" is realized and to have relations such as nsubj:weak, nsubj:canonical, nsubj:dislocated (or dislocated:nsubj) to indicate that we have units in different positions that are potential subjects. And why we recommend to avoid to use a plain dislocated relation for a candidate to the subject position, because it will be difficult later to use the data to solve the question.

"transversal" and flexibly combinable subtypes are indeed something that would have its use. I proposed them here #955 but they keep coming up in similar discussions. "Layers" for deprel, in a sense"!

No, that's not the case - in the 'option 2' construction, there is no pronoun, just a lexical NP subject. If it were an inflectional morpheme, it should always be there IMO, even when the subject is lexical. What's more, the same forms appear as object pronouns, prepositional complements, etc. Cases 2 and 3 below correspond 1:1 to their Semitic equivalents, which are regularly regarded as pronouns:

1. f-sōtm "he hears"

2. nmma-f "with him" (cf. Arabic ma'a-hu)

3. rat-f "his foot" (lit. foot-PRON, cf. Arabic qadamu-hu)

They continue looking very much like bound personal morphs to me (this is also what I recall from my little Arabic). Case 2 reminds me of "inflected" adpositions in languages like Irish, but also Hungarian; case 3 of many possessive affixes. Are there not "full", "strong" forms for pronouns?

If it were an inflectional morpheme, it should always be there IMO

This was a point of confusion which I probably did not elaborate. I do not think that inflectional affixes need to be always mandatory: there might be some redundancy and cases where they might or might not appear. This is common typologically for plural affixes, but it would not surprise me for person indexing.

The previous case number 2 looks very interesting and raises many questions (incorporation? nature of TAM element a?). But more generally, I meant that also methodological questions are raised if the commonest construction is annotated as the very marked, and at the same time underdefined, dislocated. As @sylvainkahane points out, this becomes a problem for queries and data/construction extraction, as the lexical subject will be very often "invisible" or require prior non-trivial knowledge to be identified. If an analysis as affix is not viable, I think that, for example, the use of expl could be a well grounded working compromise at least, here.

amir-zeldes commented 4 weeks ago

They continue looking very much like bound personal morphs to me (this is also what I recall from my little Arabic)

No, they are regarded as pronouns in Arabic, and this is handled the same in UD Coptic, Arabic and Hebrew. This example illustrates both the prepositional object (token 26) and the possessive enclitic pronoun (token 28). Notice both are tagged as PRON and treated as the head of the PP and a genitival modifier respectively:

24  عثرت    عَثَر   VERB    VP-A-3FS--  Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Voice=Act   17  advcl   17:advcl:عِندَمَا   Vform=عَثَرَت|Gloss=discover,come_across,find|Root=` _t r|Translit=ʿaṯarat|LTranslit=ʿaṯar
25-26   عليه    _   _   _   _   _   _   _   _
25  علي عَلَى   ADP P---------  AdpType=Prep    26  case    26:case Gloss=on,above|LTranslit=ʿalā|Root=` l w|Translit=ʿalay|Vform=عَلَي
26  ه   هُوَ    PRON    SP---3MS2-  Case=Gen|Gender=Masc|Number=Sing|Person=3|PronType=Prs  24  obl:arg 24:obl:arg:عَلَى:gen    Gloss=he,she,it|LTranslit=huwa|Translit=hi|Vform=هِ
27-28   شقيقته  _   _   _   _   _   _   _   _
27  شقيقة   شَقِيقَة    NOUN    N------S1R  Case=Nom|Definite=Cons|Number=Sing  24  nsubj   24:nsubj    Gloss=sister|LTranslit=šaqīqat|Root=^s q q|Translit=šaqīqatu|Vform=شَقِيقَةُ
28  ه   هُوَ    PRON    SP---3MS2-  Case=Gen|Gender=Masc|Number=Sing|Person=3|PronType=Prs  27  nmod    27:nmod:gen Gloss=he,she,it|LTranslit=huwa|Translit=hu|Vform=هُ

"inflected" adpositions in languages like Irish, but also Hungarian

I don't know Irish or Hungarian, but in my mind if a preposition inflects, that would mean that it agrees with something or expresses some categories to indicate a choice with semantic meaning in a paradigm. What we have here in the prepositional case is simply an allomorph of the preposition which is triggered in the environment of a pronoun as the object. This is common in Afro-Asiatic languages and works the same in Coptic, Egyptian, Arabic and Hebrew, to name a few. For example:

Notice that the preposition does not change its form based on person or number, or definiteness, or contact with an article, or anything else - it's an automatic allomorphic alternation based solely on whether the object is pronominal or nominal.

Are there not "full", "strong" forms for pronouns?

Yes, these are the clitic pronouns, and there are also independent ones, like the "ntof" above, which I loosely translated "as for him". It's used in more marked information-structural environments. But the same thing happens in Indo-European (e.g. Polish dat. strong mnie/enclitic mi "me", tobie/ci "you") but we don't say any of those are not pronouns just because some of them have to be post-tonic.

the commonest construction is annotated as the very marked

I wouldn't say it's the most common one - that would be just a pronominal subject, with no dislocation. And lexical NP subjects are not exactly rare, perhaps because the UD corpus is focused on classical literature. I just had a look, and for the past tense (which is admittedly only one environment), we get:

So yeah, lexical NPs are conspicuously rare in Coptic, but they're still 12% of the data, and treating all pronouns as inflection just because of that would suddenly mean that Coptic becomes a pro-drop language with 70% subjectless sentences. I'm sure this would surprise a lot of people working on the language, since there is no real discontinuity here with Ancient Egyptian. Those pronouns are standing exactly where the subject was standing in late Egyptian, where dislocations are much less common. In short, I think of Coptic as exhibiting a language change in progress, which never came to completion, but the results of which would have given us a language like Hausa.