UniversalDependencies / docs

Universal Dependencies online documentation
http://universaldependencies.org/
Apache License 2.0
273 stars 248 forks source link

Three questions on Setswana (Bantu) annotations #1008

Closed tanjagaustad closed 6 months ago

tanjagaustad commented 10 months ago

In December, we have done a mini-project to translate and annotate the 20 CAIRO sentences for Setswana. Before sharing and uploading the sentences we have 3 questions we hope people with more experience in UD can maybe help answer.

1 AUX Currently, AUX is used in the Setswana annotations for relations involving auxiliary verbs, for TAME (Tense Aspect, Modality, Evidentiality) morphemes and for the construction go (SC15) + infinitive. Is this the generic way of annotating these in Bantu languages? What about negative morphemes (ga, a, sa and se), the future tense morpheme (tla), the progressive sa or the potential ka?

2 Copulatives A copula is the relation of a function word used to link a subject to a nonverbal predicate. In Setswana three types of copulative verbs are distinguished: identifying copulative, describing copulative and associative copulative. These copulative verbs are now tagged as COP with the Noun being the Root of the sentence (parallel to English). Is this the correct interpretation of copulatives? How is this treated in other Bantu languages? e.g. A Iguazu ke_AUX_cop naga_Noun_Root e kgolo kgotsa ke e nnye? Is Iguazu a big or a small country? e.g. Ga ba na_VCop_cop kakanyo_Noun_Root epe gore e kwadilwe ke mang. They have no idea who wrote it.

3 Interrogative Particle In Setswana there is an interrogative particle 'a' added at the beginning of a sentence to change an indicative sentence to an interrogative one. We are currently not sure what relation type should be used to tag the relation between the root of the sentence and the interrogative particle. e.g. A o batla go tsamaya? Do you want to go? e.g. A Iguazu ke naga e kgolo kgotsa ke e nnye? Is Iguazu a big or a small country?

sylvainkahane commented 10 months ago

1 AUX Currently, AUX is used in the Setswana annotations for relations involving auxiliary verbs, for TAME (Tense Aspect, Modality, Evidentiality) morphemes and for the construction go (SC15) + infinitive. Is this the generic way of annotating these in Bantu languages?

Yes for TAMEs. Not for go + infinitive a priori. But it depends on how far his construction is from TAME constructions.

What about negative morphemes (ga, a, sa and se),

Again it depends on their distribution. They can be ADV, AUX, or PART. For instance, for Naija, a pidgincreole of English, we decided to annotate no and never as AUXs, because they behave like TAMEs and not like adverbs in this language

the future tense morpheme (tla), the progressive sa or the potential ka?

Why don't you include them in TAMEs? Do they have a different syntactic behavior?

dan-zeman commented 10 months ago

I cannot comment on how anything is done in other Bantu languages because, unfortunately, we still do not have a single Bantu language in the official UD releases (although several people expressed interest in adding such languages and asked for corresponding repositories to be created).

Question particles have been the topic of several previous issues (#178 is probably the oldest of them) and I am afraid the discussion is still not properly reflected in the guidelines. I would probably tag them PART and attach them to the main predicate using the advmod relation, possibly subtyped advmod:que, as is currently done in Hungarian.

Copula(s) and auxiliaries have to be registered for each language so that the official validator accepts them. You can try it here – if you cannot identify a suitable function of the auxiliary in the registration form, it might be a signal that it probably should not be an AUX in UD. On the other hand, being able to identify the function does not necessarily mean it is an auxiliary, as there might be additional tests specific to the syntax of the language (especially modals are not auxiliaries in all languages).

As for copula, the rule of thumb is that normally we expect at most one lemma to serve as copula in the given language; but there are several legitimate reasons to deviate from this rule, so if documented properly under "Deficient paradigm", the registration form will allow multiple copulas. Besides the documentation of the cop relation, see also this page.

tanjagaustad commented 10 months ago

Thank you for the comments and pointers so far. This is definitely helpful as we have little experience in UD (yet). I will discuss it with my collaborators and will post an update soon.

Stormur commented 10 months ago

Hi! It is nice to see more Bantu languages worked on.

2 Copulatives A copula is the relation of a function word used to link a subject to a nonverbal predicate. In Setswana three types of copulative verbs are distinguished: identifying copulative, describing copulative and associative copulative. These copulative verbs are now tagged as COP with the Noun being the Root of the sentence (parallel to English). Is this the correct interpretation of copulatives? How is this treated in other Bantu languages? e.g. A Iguazu ke_AUX_cop naga_Noun_Root e kgolo kgotsa ke e nnye? Is Iguazu a big or a small country? e.g. Ga ba na_VCop_cop kakanyo_Noun_Root epe gore e kwadilwe ke mang. They have no idea who wrote it.

Looks good to me.

3 Interrogative Particle In Setswana there is an interrogative particle 'a' added at the beginning of a sentence to change an indicative sentence to an interrogative one. We are currently not sure what relation type should be used to tag the relation between the root of the sentence and the interrogative particle. e.g. A o batla go tsamaya? Do you want to go? e.g. A Iguazu ke naga e kgolo kgotsa ke e nnye? Is Iguazu a big or a small country?

I would also go for PART for these elements, but I would rather link them to the root as discourse (as it is currently done in Latin with ne, for example). I think it would also be useful to tag them with PartType=Int. The reasoning behind this is that such particles do not modify the predication, at least not in the way a manner adverb does; their function belongs more to a pragmatic level, and therefore discourse. In general, I think it is better not to further overload advmod.

1 AUX Currently, AUX is used in the Setswana annotations for relations involving auxiliary verbs, for TAME (Tense Aspect, Modality, Evidentiality) morphemes and for the construction go (SC15) + infinitive. Is this the generic way of annotating these in Bantu languages? What about negative morphemes (ga, a, sa and se), the future tense morpheme (tla), the progressive sa or the potential ka?

Since AUX is (at least originally, I do not find guidelines particularly enlightening on the extension to non-verbal elements) meant as a functional counterpart to VERB, if these particles do not show some kind of verbal morphosyntax, then I would veer towards PART (and surely not ADV if they are merely functional; also see point above). This is already canonically the case for negativisers in UD (e.g. Latin non). Then, in the case of TAME carriers, I think there is no problem in making them depending with the aux relation (...right, @dan-zeman ? I am not used to this).

I also agree that if go + infinitive is not a TAME construction, a different treatment than AUX/aux is needed (acknowledging that UD seems to have grey zones with regard to verb serialisation).

dan-zeman commented 10 months ago

Since AUX is (at least originally, I do not find guidelines particularly enlightening on the extension to non-verbal elements) meant as a functional counterpart to VERB, if these particles do not show some kind of verbal morphosyntax, then I would veer towards PART (and surely not ADV if they are merely functional; also see point above). This is already canonically the case for negativisers in UD (e.g. Latin non). Then, in the case of TAME carriers, I think there is no problem in making them depending with the aux relation (...right, @dan-zeman ? I am not used to this).

No. aux (relation) implies AUX (UPOS tag). It is a one-way implication.

The extension of AUX from auxiliary verbs to non-verbal particles occurred as part of the transition from UD v1 to v2 guidelines in 2016. The relevant section says:

  1. The use of AUX is extended from auxiliary verbs in a narrow sense to also include copula verbs and nonverbal TAME particles (tense, aspect, mood, evidentiality, and, sometimes, voice or polarity particles).
  2. The use of PART is restricted to a small set of words that must be listed in the language-specific documentation.

The borderline of auxiliaries remains fuzzy in the case of negation. In English, the auxiliary do may be needed to negate a clause (as in I do not speak Tswana), but not is not considered an auxiliary; it is tagged PART and attached via advmod. However, since negation auxiliaries as such are permitted, it may not be clear in other languages whether the word is closer to do or to not in English.

Stormur commented 10 months ago

All right. Thanks for brushing me up again on the subject!

At the same time, though, cop can be used with DET or PRON, beyond AUX. I feel this creates some confusion (at least to me).

With what is said above, then there should be no discussion that those Setswana particles have to be tagged as AUX? Even if, to be honest, I am not a fan of these constrained POS-deprel couples, as they seem to force non-optimal choices.

However, since negation auxiliaries as such are permitted, it may not be clear in other languages whether the word is closer to do or to not in English.

Does the verbality of AUX not come into the picture again, here? Maybe I am simplifying, but if one sees that the negativiser has verbal morphosyntax, it is AUX (with Polarity=Neg); else it is regularly PART.

Then, if I understand well, the problem is that if we do have non-verbal auxiliaries, then why shouldn't a non-verbal negativiser be one, too? But would this not be solved simply by reverting to AUX as only verbal (restoring symmetry), and allowing for the aux relation to also target (TAME) particles? And then yes, aux could be sensibly extended to negative particles, too (possibly as aux:neg).

I mean, as always the important thing here is the relation. Then it is interesting to see by which word classes it is realised, but this extension of AUX obliterates this variation.

sylvainkahane commented 10 months ago

We do do syntactic annotation. I think it is important the POS correspond more or less to distributional classes. It is why modal verbs (can, must …) are annotated AUX in English, because they behaves as TAME (will, have …) and the copula be: same position in interrogative sentences, same position towards the negation or adverbs. In Romance languages, it wouldn't make sense to put modal verbs in AUX (or it will be a pure semantic choice to be parallel to English), because they behaves as plain verbs.

For the POS of negative items, we must base us on their distributional properties. For instance in French, the negation pas more or less behaves as other adverbs (plus 'no longer', jamais 'never', but also encore 'again', etc.) and we put it in ADV. In English, not has a very special behavior because it needs to cliticize to modals, which justifies not to put it in ADV but in PART. In other languages, the negation behaves as ENglish don't and can be put in AUX. It is the case in Naija, where no and never (borrowed to English) behaves as TAME particles and has been put in AUX.

Stormur commented 10 months ago

Annotation is syntactic, but annotation of parts of speech cannot be fully and only syntactic/distributional: else we would have no need for part-of-speech tags and we could keep only dependency relations.

I do not get how modal verbs behave differently in Romance languages than in English, apart the structural differences between the two language families: to me they seem the same, or better, fully comparable. In Romance languages they also distribute like have or be, from the behaviour of clitics you see that they form a greater unit with the lexical verb, they inflect while the lexical verb does not... so in this sense I would see no problem in annotating them as AUX in Romance languages, too (and some treebank already does this), and only from (morpho)syntactic reasons. But semantics also has its importance in determining what we would prefer as an annotation.

Anyway, in both cases these words behave verbally, so there is no issue in assigning them to the VERB/AUX class. The problem arises with particles like these in Setswana. In my opinion, the extension of AUX to non-verbal elements is a backprojection at the POS-level of the aux relation that they surely entertain, but it is not helpful at all.

In English, not has a very special behavior because it needs to cliticize to modals, which justifies not to put it in ADV but in PART.

I do not see the implication. Cliticisation is mostly determined by phonology, and in the case of English it just happens to be often marked in the orthography, while for example in Italian it is not, but non is by all means clitic. This does not make it more or less adverbial.

In other languages, the negation behaves as ENglish don't and can be put in AUX. It is the case in Naija, where no and never (borrowed to English) behaves as TAME particles and has been put in AUX.

For the specific case of English, I would say that don't has to bee analysed as a multi-word token, exactly for the reason that not is clitic. Then for sure not (non-verbal) is PART and do (verbal) is AUX. But there are indeed negative verbs used as auxiliaries around the world, for example ei in Finnish. (And there are also lexical ones like Latin ignoro 'I don't know')

For instance in French, the negation pas more or less behaves as other adverbs (plus 'no longer', jamais 'never', but also encore 'again', etc.) and we put it in ADV.

But does it also behave like merveilleusement?

sylvainkahane commented 10 months ago

We should stay focus on Setswana. I just said that the negation can behave to different POS according to its distributional properties. There are languages were the negation is AUX, ADV, or PART (or is a inflexional morpheme). Therefore to answer to @tanjagaustad about the negative morpheme ga, a, sa and se in Setswana we need to know how they behave syntactically.

Stormur commented 10 months ago

I think that a bit of semantics is needed, though, else we risk circularity: how do we define ADV/PART/AUX in Setswana in the first place?

tanjagaustad commented 9 months ago

Thanks everyone for your input - much appreciated.

For the interrogative particle we have decided to go with the approach used in e.g. Latin, annotating the a as a PART with a discourse relation with the root of the sentence.

@dan-zeman Thanks for the pointers re COP and AUX:

Copula(s) and auxiliaries have to be registered for each language so that the official validator accepts them.

This definitely helped in the discussion on how to annotate them.

Re the rest of the discussion and for some background: Historically SA Bantu languages are either written conjunctively (e.g. isiZulu, isiXhosa) of disjunctively (e.g. Setswana, Sepedi). As a consequence, what would be one orthographic token in e.g. isiZulu will be several orthographic tokens in Setswana. Especially for the verbs, this means that in Setswana we find a lot of orthographic tokens preceding the verb that in traditional linguistics are seen more as morphemes than "proper" words. This includes a lot of TAME morphemes/words. In those cases, it is less straightforward what POS and/or relation to attribute. We are currently writing an article discussing these issues in more detail. Also, I expect that once we run our current annotations through the validators, a few issues might pop up.

All that said, it definitely helped to hear how different languages treat e.g. negation and modals in UD. Judging from the discussion, it also seems that not all issues have yet been decided. So I guess we will try and add our 5 cents to it as we go along.