UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
271 stars 245 forks source link

Focus in Wolof #590

Closed chedio closed 5 years ago

chedio commented 5 years ago

Hi, I am currently developing a UD treebank for Wolof (Niger-Congo). One of the main features of this language is that focus is marked morphosyntactically by means of inflectional elements. Such elements convey grammatical specifications of the verb, including person, number, mood, but also the information structure of the sentence (focus).

Wolof has three focus constructions: Subject focus, verb focus, and complement focus. These constructions vary morphosyntactically according to the syntactic function of the focused constituent: subject, verb, or complement (any constituent which is neither subject nor main verb). At the morphological level, the subject takes a different form depending on the focus type. At the syntactic level, we can identify two different word order: Subject and verb focus sentences are SVO, while the complement focus has OSV.

Here some examples for the 3rd person singular subject. (lekk "eat" and jën "fish"):

In the current UD guidelines, there seems to be no feature that covers this aspect. I was wondering if it would be possible to introduce a new feature Focus that would take (at least) one of the following values: Subj or Verb or Comp. These would correspond to feature for the focus markers moo, dafa and la, respectively.

sylvainkahane commented 5 years ago

moo combines the subject PRON mu and the AUX a. Do you plan to split it up?

1    moo
1.1  mu
1.2  a

If you do that, it will be easier to find the focus marker. Of course a feature such as Focus can be introduced but it becomes less necessary.

chedio commented 5 years ago

No, I am not planing to split moo for the simple reason that, unfortunately, the situation is not so easy. There are other focus markers that cannot be split in this way. For instance, la cannot be split anyway. Splitting dafa into da and fa would not make sense, as fa has no meaning.

I think that Focus is really the feature that would allow to make the distinction between these three types of constructions (and at the same time for the inflectional markers moo, la and dafa). I do believe that such a feature is necessary.

sylvainkahane commented 5 years ago

la does not combine with a PRON, so it is not probematic. Its lemma will be la.

But what will be the lemma of moo if you don't decompose it? What will you do when the focus marker a amalgames with a NOUN or a PROPN:

Faatoo lekk jën Fatu.EMPH eat fish

da combines with a PRON. For instance:

da-ñu lekk jën da-3PL eat fish

dafa is an irregular form, but it doesn't matter, you can split it. The decomposition is syntactic, not morphologic:

1   dafa
1.1 da  AUX  2  aux
1.2 mu  PRON 2  nsubj
2 lekk VERB 0  root

What will be the relation between dafa and lekk if you d'ont split it?

That's the first question to solve. Do we want to treat the focus markers as tokens (AUX or PART) or as features? I think it is possible to treat them as tokens. Of course they tend to cliticise and amalgam but it is quite similar to English and not really worse (I'm, you're, we'll, they've, wanna, won't). If you treat them as tokens, your Wolof grammar would be much more simpler and regular, I think.

chedio commented 5 years ago

I think the problem is not a tokenisation issue, but rather an issue of a feature. Fundamentally, I would not be in favour of splitting some of these elements and not the others. It's true that da combines with ñu. But such a morpheme (ñu) also attaches to all verbs when these are inflected for person and number (for instance, in negation). In such a case, I do not attempt to split the verb into the verb stem and a hypothetical pronoun. Instead, I just annotate this (e.g. ñu) as the Person and Number features of the verb. The same applies for an inflectional element like da-ñu. I think it would be good to avoid overusing word segmentation (the question would be then where will this end). Another related issue is that such a segmentation will cause ambiguities. For instance, splitting da-nu "da-1PL" into da and nu would make nu ambiguous between a subject marker and an object marker.

Moo, dafa and la can be assigned the lemma mu without much trouble. The syntactic relation between dafa (or any of these elements) and lekk is aux in the case the lexical subject is present and nsubj in the case there is no apparent lexical subject.

chedio commented 5 years ago

For your first question. Yes, of course, when the focus marker a amalgames with a NOUN or a PROPN (e.g. Faatoo), then a split occurs. Faatoo becomes Faatu a.

sylvainkahane commented 5 years ago

If Faatoo is split in Faatu a, I don't understand why moo is not split in mu a.

And you said "splitting da-ñu "da-1PL" into da and ñu would make ñu ambiguous between a subject marker and an object marker." But of course not: if ñu is a token, it will receive a governor and a function. It will be clearly nsubj or obj. ñu is not cliticised in some other constructions and can appear alone, so you will have it a token sometimes.

The only irregular forms are with 3SG mu. If you treat them as almagams, the grammar is simpler. I think that the lemmas of moo, dafa and la must not be mu, but mu+a, da+mu and la+mu. Or if you don't want to split them, they mut be a, da and la with features for subject agreement.

chedio commented 5 years ago

You have right for the lemma. In the current state, I label them as moo, dafa and la (sorry for not being clear in my previous answer. I did not bother so much about the lemma so far).

But still, I would not be in favour of splitting moo into mu+a. The reason for splitting Faatoo in Faatu and a is that the whole word (i.e. Faatoo) can only be tagged as a noun. This is quite different from "moo" which can easily be treated as an "auxiliary". I don't really see what is the problem for treating moo, dafa and la as tokens and as aux. Even if I would split them, I would still need a, da and la to indicate to the respective focus feature, since it's the main information they are conveying in their respective constructions. On the other hand, if you have a sentence like Faatu moo lekk (It's Faatu who has eaten), it is easy to see that Faatu is nsubj, moo is aux and lekk the head. Otherwise, if you split moo in mu and a, both Faatu and mu should actually bear the nsubj function, unless you treat the first constituent as dislocated. I do think such an analysis would not be really motivated and more complicated than if I don't split.

sylvainkahane commented 5 years ago

But I don't think that Faatu is subject in Faatu moo lekk. This sentence is better translated by 'Fatu, it's her who has eaten', in contrast with Faatoo lekk 'It's Fatu who has eaten'. I think that Faatu is dislocated in Faatu moo lekk and the subject is mu. Especially because Faatu is not obligatory, which would be very unusual for a focused element.

chedio commented 5 years ago

I don't think Faatu moo lekk and Faatoo lekk differ in meaning. In both cases, I see Faatu as the subject. The fact that Faatu is optional just follows the pro-drop nature of the language which holds also for non-focused constructions.

sylvainkahane commented 5 years ago

But your argument is circular. You can only consider Wolof as a pro-drop language if you don't accept that moo is mu+a. Before, you said that moo has mu as lemma and it was the subject of lekk in _moo lekk_and now you say that moo is an AUX and lekk has no subject. I really think that you complexifies the things. Do you have a document or guidelines for your Wolof grammar? I don't see anything on the UD website.

chedio commented 5 years ago

The grammar is under construction and I will upload it when I have finished cleaned up some stuff. I think I my argument was quite clear: moo is nubj if the lexical subject Faatu is missing. Otherwise, if the lexical subject is present, moo is aux. This is a quite natural behaviour of a pro-drop language. Anyway, my main point with raising this issue was in fact only if it would be possible to introduce a new feature (that could be called Focus or FocusType) that is relevant for Wolof enough to make the distinction between subject focus, verb focus and complement focus.

sylvainkahane commented 5 years ago

The discussion is related to your initial question, because if you introduce focus markers as lemmas, you don't necessary need a feature Focus. Or at least you can add it automatically very easily (a is subject focus marker, etc.).

But you propose another analysis, which I find unusual and unnecessarily complex. Your analysis is not the typical analysis of a pro-drop language. In a pro-drop language, you have verbal form with an optional subject and the analysis of the verbal form doesn't change when the subject is missing. Here you propose two analyses of moo lekk. Not only you propose two different functions for moo (nsubj vs aux), but I suppose that you also propose two different POS (PRON vs AUX) and two different lemma (mu vs a). I don't see any reasonable argument for doing this. If you do that for 3SG you need also to do that for 3PL (ñoo lekk) and probably the other forms.

I think two analyses are possible.

First analysis: Wolof as a pro-drop language (Faatu) moo lekk moo is analysed as an inflected form of the AUX a and Faatu as the subject. I find this analysis a bit strange because Moo lekk would be a sentence without subject with a focused subject, which is quite paradoxal.

Second analysis: Wolof as a non pro-drop language moo is analysed as the amalgam of a subject mu and the AUX a and Faatu as a dislocated element. The analysis of moo lekk is then parallel to Fatoo lekk, that you agree to analyse as Faatu+a lekk.

chedio commented 5 years ago

Ok. To make it clear: I propose two functions for moo (nsubj vs. aux), but I do not propose two different POS nor two different lemmas. 1) First, the POS for moo is AUX (not PRON). 2) The lemma for moo and all subject focus markers is moo. Similarly, dafa and la are the markers for the verb and complement focus markers, respectively.

Now, concerning the first analysis you mention, I do not see any paradox. Because, moo lekk has a subject which is moo. It is just that the nominal moo is referring to is not specified and some contextual information is needed to determine the concrete referent. Let us just take another example with a non-focused sentence: lekk na (he has eaten). Here, we have exactly the same situation. This sentence has clearly a subject which is na (3SG). If the lexical subject appears (e.g. Faatu lekk na), then na takes the aux function. Otherwise, na takes the subject function. In both cases, na still has the AUX part of speech. I do not see why this analysis cannot be applied to moo as well (given the fact that this will provide a uniform analysis for all Wolof inflectional elements like moo, dafa, la, na, dina, etc.)

ftyers commented 5 years ago

@beemorris @KatyaAplonova is there stuff like this in Laiholh and Bambara ? I seem to remember there being issues with working out which part is the subject (e.g. agreement marker/pronoun or full NP) in both those languages.

Let us just take another example with a non-focused sentence: lekk na (he has eaten). Here, we have exactly the same situation. This sentence has clearly a subject which is na (3SG). If the lexical subject appears (e.g. Faatu lekk na), then na takes the aux function. Otherwise, na takes the subject function. In both cases, na still has the AUX part of speech. I do not see why this analysis cannot be applied to moo as well (given the fact that this will provide a uniform analysis for all Wolof inflectional elements like moo, dafa, la, na, dina, etc.)

I don't know anything about Wolof, but I'm not sure if according to the guidelines the AUX can receive the nsubj relation, which has (overwhelming so far) been reserved for nominal elements.

chedio commented 5 years ago

Thank you for raising this issue. For me, it was not clear neither whether the guidelines forbidAUX to receive the nsubj relation. I hope someone can help us clarify this. If it is the case that AUX cannot bear nsubj, then this would naturally raise an issue for inflectional elements like na as in lekk na (eat 3sg, meaning he/she has aten). This is because na is clearly the subject and exactly fits the AUX category as defined in the guidelines:

An auxiliary is a function word that accompanies the lexical verb of a verb phrase and expresses grammatical distinctions not carried by the lexical verb, such as person, number, tense, mood, aspect, voice or evidentiality

In Wolof, na is not a PRON and would not really fit any other category than AUX.

ftyers commented 5 years ago

No problem! I'm curious, what are the criteria you are using for subjecthood here? Again, I don't know anything about Wolof, so it would be good to have a both positive and negative examples for each of the criteria.

jnivre commented 5 years ago

It seems really strange to assign the relation "nsubj" to an element tagged as AUX, especially since that element occurs with the "aux" relation in other contexts. As far as I understand, Sylvain's proposed analysis is the one that fits both the data and the UD guidelines best. This includes both segmenting "moo" into "mu" and "a" and treating fatuu is "dislocated" when it cooccurs with "moo". The latter is exactly parallel to the situation in French, where full noun phrases are treated as subjects when they occur without clitic pronouns but as dislocated when both are present:

Jean vient nsubj(vient, Jean)

Jean il vient nsubj(vient, il) dislocated(vient, Jean)

This is the recommended treatment for this kind of doubling in UD. See, for example, http://universaldependencies.org/fr/dep/dislocated.html

chedio commented 5 years ago

The main criterium for subjecthood I use here is subject-verb agreement. In Wolof verbs bear agreement morphology, and the subject is the argument that agrees with the verb. In the sentence given above, it is na that has this function.

chedio commented 5 years ago

Well, if AUX cannot be assigned the nsubj relation, does this mean that na must be a PRON? I think this would not really reflected the linguistic reality of Wolof. I know we should try to comply as much as possible with the guidelines, but do we really have to analyze Wolof in the same way as French. Another issue is splitting moo into mu and a. Sure, such a segmentation can be done, but I still have the feeling that, soon or later, a FocusType feature is needed for languages that mark focus morphosyntactically in the way Wolof does, and this independently of tokenization and lemmatisation or pro-drop issues.

ftyers commented 5 years ago

The main criterium for subjecthood I use here is subject-verb agreement. In Wolof verbs bear agreement morphology, and the subject is the argument that agrees with the verb. In the sentence given above, it is na that has this function.

So in the lekk na and Faatu lekk na examples, what is the agreement marker and what is the subject?

chedio commented 5 years ago

na is filling both functions: subject and agreement marker.

ftyers commented 5 years ago

na is filling both functions: subject and agreement marker.

So, your argument is: Wolof verbs must have both an agreement marker and a subject, if there is an overt NP, the NP is the subject and the agreement marker fulfills its own role, if there is no overt subject, then the agreement marker takes on the role of subject and there is no explicit agreement marker?

chedio commented 5 years ago

If there is an overt NP (e.g. Faatu lekk na), the NP (Faatu) is the subject and na fulfills its role as an agreement marker. Otherwise, if there is no overt NP (e.g. lekk na), then na has a double function as the subject and the agreement marker.

chedio commented 5 years ago

To me, it sounds strange to not assign na the nsubj relation in the second case.

ftyers commented 5 years ago

There can be a lot of strange things in linguistics :) How about these questions:

  1. Does na behave like other subjects (pronominal and full NPs)
  2. Does na cause agreement on the verb?
  3. Is the position of na fixed?
  4. Can there be intervening linguistic matter between na and the verb?
  5. What other words can fill the slot that na has?
  6. Can the na be dropped/elided?
  7. Can other subjects be dropped/elided?

According to your definition it seems like are two types of subjects in Wolof, those that are nominal/pronominal which cause agreement on the verb, and those which are auxiliary which replace agreement on the verb. Is that correct? If so, I'm fairly sure that the second kind does not fit the definition of nsubj in the universal guidelines.

chedio commented 5 years ago
  1. No, na does not behave like pronominal and full NPs. The latter always precede the verb, while na follows the verb. Wolof has a pronoun for the 3SG which is mu.
  2. Is already clear
  3. na always immediately follows the main verb, with the only exception that in rare cases the past tense particle woon may intervene between na and the verb.
  4. woon is the only element that may intervene between na and the verb.
  5. None

I spent my whole PhD writing an LFG grammar for Wolof. In the LFG model, na is analyzed as an INFL for inflection. There is no doubt that na is not a pronoun. Besides word order, there are many evidences that distinguish na from subject pronouns like mu in Wolof. Subject pronouns have a predictable distribution, i.e. exactly the same position as their corresponding full NP counterpart. They are specified for nominative case. Phonologically, subject pronouns occur in sentence-initial position, bear default initial stress (are weakly stressed) and remain unattached. In contrast, na is never stressed and does not exhibit any of these features that are found for subject pronouns.

sylvainkahane commented 5 years ago

@chedio In Moo lekk or Lekk na_, the verb is lekk. The verb itself is invariable and we cannot consider that the verb agrees with the subject in Wolof. The question is whether the subjectal index on the auxiliary is a pronominal subject or an agreement morpheme. This question is debatable. But you cannot analyze this index as both the subject and the agreement with the subject.

chedio commented 5 years ago

Why can we not consider that the verb agrees with subject in Wolof? This is only true for conjugations where the verb shows no inflection. But in other cases like in the na conjugation and in negation, it is clear that the verb agrees with the subject.

KatyaAplonova commented 5 years ago

To my knowledge, in Wolof there are many predicative elements like ma. Maybe, you'll introduce a special relation for them? aux:subj?

Le lun. 19 nov. 2018 à 21:27, chedio notifications@github.com a écrit :

Why can we not consider that the verb agrees with subject in Wolof? This is only true for conjugations where the verb shows no inflection. But in other cases like in the na conjugation and in negation, it is clear that the verb agrees with the subject.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UniversalDependencies/docs/issues/590#issuecomment-440030694, or mute the thread https://github.com/notifications/unsubscribe-auth/AeayPQuwrMulAghGcNukJhmroT-o7Anaks5uwxREgaJpZM4Ynn7X .

chedio commented 5 years ago

Thank you. This seems to be a good suggestion. I am curious what the others will think about this.

jnivre commented 5 years ago

I think we are all willing to accept that "na" can be analysed as an agreement marker (although it is not clear to me that it must be analysed this way). What most people (including me) find strange is the claim that "na" also assumes the subject function when there is no noun phrase subject. Since an agreement marker agrees with the subject, this would imply that "na" agrees with itself, which does not make any sense to me. Hence, as pointed out by Sylvain, it seems it must either be analysed as a subject pronoun, in which case there is no subject-verb agreement in Wolof, or as an agreement marker, in which case Wolof is a pro-drop language. There are many languages of both these types, but I still haven't seen any evidence that Wolof is of a unique third type.

Let me also say that this discussion is (mostly) orthogonal to the discussion of whether a focus feature is needed.

chedio commented 5 years ago

Ok. To come back to the original issue, I know that many people are reluctant in having a new feature (focus), since in most other languages, information structure and syntax work at different levels, which is not the case in Wolof.

At this point, I can maybe cite Stéphane Robert (2000) in Le verbe wolof ou la grammaticalisation du focus. Louvain: Peeters, Coll. Afrique et Langage, 229-267. Pointing out the role of focus in Wolof, Robert said:

Ainsi, dans cette langue, la hiérarchie informationnelle, loin d'être secondairement surimposée à un noyau prédicatif stable, organise au contraire le système verbal et conditionne le choix-même de la conjugaison dans tout énoncé. On a donc à faire, avec le wolof, à un cas extrême de grammaticalisation du focus.

I can try to translate this as:

In Wolof, the information structure is not merely a secondary component superimposed on a core syntactic one, but it rather organizes the verbal system and even conditions the choice of the inflectional markers for each focus construction type.

Now, my question is: how would such a so relevant aspect of the language be ignored and not be represented at the feature level.

jnivre commented 5 years ago

I think you misunderstood me. I was not saying that this feature should be ignored. On the contrary, it it is an aspect that is systematically encoded morpho-syntactically then it should in principle be part of the annotation. However, there is one important restriction. UD currently only allows features on single words. If a feature is realised in a construction involving multiple words, then it currently cannot be represented. This is something that affects many languages. Let me just take two examples to illustrate what I mean.

Interrogative mood (or sentence type) is often realised by using a different word order from declarative sentences, but since this is a global property involving several words, it cannot be represented.

Perfect tense is often realised by having an auxiliary in the present tense together with a participial form of the main verb. In such cases, the UD annotation will typically tell you that the auxiliary has present tense and that the participle is a past participle (or something like that). But nothing in the annotation will capture the notion of perfect.

We considered adding features at the phrase level when introducing v2 of the guidelines, but the cases that seemed to require this were not pressing enough to motivate a considerable change in the annotation format and the CoNLL-U file format. Conceivably, this issue could be revisited for future versions, but right now this is a limitation we have to live with.

It is not clear to me whether the annotation of focus in Wolof requires features at the phrase level. If it does, then it may be difficult to accommodate this need, not because we want to ignore a fundamental property of the language but because of inherent limitations in the annotation scheme that also affects other languages.

I therefore suggest that you work out a proposal for the focus feature (taking the previous discussion of auxiliaries and subjects into account), including possible values as well as a proposal for which words should carry the features. Then we can see more clearly whether it is affected by the formal limitation or not.

chedio commented 5 years ago

Fine. Thanks. I can take some time and try to come up with such a proposal.

Just a short note regarding the issues you mentioned. The main idea when raising this issue is that focus is systematically encoded morpho-syntactically in Wolof and that such information would be feature on single words like moo, dafa and la which respectively indicate that the constituent in focus is the subject, the verb or complement. Complement means any constituent which is neither subject nor main verb.

Thus, the word marking focus (e.g. moo) would have a feature annotation like FocusType=Subj (since it conveys the information that the constituent in focus is the subject). Likewise, dafa would have the feature FocusType=Verb, since it indicates that the constituent in focus is the predicate. Finally, la would have the annotation FocusType=Compl to indicate that the constituent in focus is neither subject nor verb. Such a constituent can be e.g. object, complement clause, obl, etc.

Basically, such a feature would only take these three values (and I think this would generalize enough for any type of constituents that could potentially be in focus).

chedio commented 5 years ago

Maybe a better word for complement could be something like non-subject.

amir-zeldes commented 5 years ago

Hi all, I'm also for features to encode overtly marked information structure categories.

I would like to add that we also have a single-token focus marker in Coptic (xpos=CFOC), so we could use a focus feature, too, but the Coptic one does not mark which constituent is focused, so for us it would be more useful to have something like Focus=Yes.

dan-zeman commented 5 years ago

Oh this thread is growing rapidly. I am also for adding a feature for focus. Perhaps it should be FocusType, while other languages that only need a yes/no distinction would use Focus=Yes. I would preferably avoid one feature that can be Yes in some languages and multiple descriptive values in others.

As for the other issues discussed here, my impression (based only on this thread, with no previous knowledge of Wolof) is that Wolof is a pro-drop language, na is an auxiliary that bears the verb-subject agreement features, it should be tagged AUX and attached as aux regardless whether a full noun phrase is present or not. That would be in-line with other pro-drop languages. The fact that agreement is marked only on the auxiliary and not on the lexical verb does not bother me, it is not unique to Wolof. But using AUX as nsubj in this context is against the guidelines, I'd say.

coltekin commented 5 years ago

Just a note on marking focus: the issue also came up during the annotation of the Tagalog-TRG treebank (annotated by @stephsamson). Currently Tagalog treebank marks focus in the MISC field with tag Focus that takes a number of different values. I do not know the languages, but it may be a good idea to use a consistent feature/value set as much as possible.

dan-zeman commented 5 years ago

@coltekin, @stephsamson: I had Tagalog in mind when thinking about possible Topic/Focus=Yes, but I had a look at the data now and I think the way it is used there, it should actually be recoded as the Voice feature in the FEATS column.

I am writing it here because I believe that it is not necessary to consider these data points when proposing the feature for Wolof. If people want to further discuss the situation in Tagalog, I'd suggest that we create a separate issue.