MW Tokenization Issues in Sindhi

UniversalDependencies / docs

Universal Dependencies online documentation

http://universaldependencies.org/

Apache License 2.0

274 stars 249 forks source link

MW Tokenization Issues in Sindhi #1007

Closed muteeurahman closed 6 months ago

muteeurahman commented 11 months ago

With reference to (Insight into when to create a class of MWT vs not? https://github.com/UniversalDependencies/docs/issues/1006): We along with @AngledLuffa are collaborating about building a UD data set for Sindhi. We are running into issues like auxiliaries have negation and pronominal affixes. We are looking at ways to assign POS tags to that. There are other constructions where Nouns, Adpositions, Adverbs, and Main Verbs have pronominal affixes attached (dual in some cases i.e. subject and object suffixes). We are wondering if we can have some advice to treat these constructions? as MWTs? or Just keep this information in morphological features? Three use cases are shown here for reference and discussion.

In sentence 1 above ناھي clearly has the role of Negation and can be labeled as ADV or PART and may be considered single word expression. (That is what I think)

In sentence 2 above ناھي clearly has two roles, the role of Negation and the role of an Auxiliary verb. Labeling is bit confusing here. However, as VERB is required here the AUX can be considered as a major role. Can this be considered as a MWE with two words (نہ and آھي) with different sub labels of ADV and AUX?

In sentence 3 above a complex pronominal suffix construction is there, where a main verb has two pronominal suffixes. So, the token (ڏنومانس) has clearly three different roles of verb, first person pronoun, and 3rd Person pronoun respectively. Can this be a MWE with VERB, 1P PRON(SUB), 2P PRON(OBJ)?

Suggestions are requested.

amir-zeldes commented 11 months ago

Regarding the negated copula, I think I would prefer to tokenize it apart into a negation and a copula, using MWTs, but if you don't want to there are some precedents for that too. Old Church Slavonic has basically the same case, with the Indo-European negation "ne" as a prefix in "нѣсть" (approx. [nʲesʲtʲ] < PIE *ne + esti). In UD Old Church Slavonic, this is treated as one token, with no MWT:

https://universal.grew.fr/?custom=6580b5ad0b772

This is then annotated with verbal features and POS (incl. AUX as necessary), and Polarity=Neg. Similarly, the Latin treebanks use negative lemmas like "nescio", "not know" (from ne+scio). If you care about representing negation separately from the verb though, nothing prevents making two tokens + MWT.

Regarding the second issue, I'm not really familiar with modern Indo-Aryan languages, but is the 1st person marker part of the old inflectional paradigm? And is the object a clitic? If so, I would consider tokenizing the object apart, but leaving the finite inflection as part of the verb token. One diagnostic might be to check if it's normal to also add an independent pronoun like "I" - if that works then the suffix may well be inflectional. If an object pronoun is easy to add on top of the suffix in the example, then this might be a case of object agreement, I know in some lanugages it's clearly different from a clitic object pronoun (e.g. Modern Aramaic).

nschneid commented 11 months ago

Maybe @aryamanarora has thoughts

muteeurahman commented 11 months ago

@amir-zeldes Regarding negative auxiliaries/copula your opinion seems logical and things are becoming clear. However, regarding the second issue where pronominal suffixation is there, yes somehow Sindhi being MIA preserved the OIA paradigms of pronominal suffixation inflections on nouns, verbs, adverbs, and even with postpositions. Yes in the above case, the object is clitic but it can also be another type of affix as well. Note that pronominal affixes with verbs may include subject affixation, object affixation, or both. In the above case, both the object and subject affixes are there. See the following examples representing the sentence (I gave him a message).

1) نياپو ڏنومانس. (Message give+Perfective+1P-obj+2P-subj) 2) ھن کي نياپو ڏنوم. (he+Obl Case-Marker message give+1P-sub) 3) مون ھن کي نياپو ڏنو. (I he case-marker message gave)

First sentence is the same above example. Second is alternate sentences where only 1P (i.e. Subject) pronominal suffix is attached with the verb "gave". While third one is a simple sentence where no pronominal suffixation is there and the subject, object, and verb stand at their places (which may vary as Sindhi is free word order). Interestingly (though not directly related to the above question) the object suffix can jump around from verb to case marker and vice versa. While keeping the syntactic function of these suffixes/clitics and their dependency relations in mind I think considering these constructions as MWEs looks workable.

Stormur commented 11 months ago

From the snippets you show us I would also favour the analysis as negative-polarity auxiliary. This in any case keeps this feature in the functional part of the phrase. Comparing this with the Latin mentioned by @amir-zeldes...

his is then annotated with verbal features and POS (incl. AUX as necessary), and Polarity=Neg. Similarly, the Latin treebanks use negative lemmas like "nescio", "not know" (from ne+scio). If you care about representing negation separately from the verb though, nothing prevents making two tokens + MWT.

... the strong reason to not split forms such as nescio, nescis, nesciueris... is that while we can indeed identify a piece ne supplying the negative polarity, this piece (morph) has no independence synchronically in Latin. We see it appearing in many negative elements (nemo 'nobody' = ne + homo 'man', neque 'and not', nimirum 'doubtless' = ne + mirum 'wonder', etc.), and it is most probably the same as the negativising prefix in-, It is not even a clitic since it directly attaches to roots or similar and nothing ever intervenes. So it belongs to morphology and as such we signal it only through the morphological feature Polarity.

This might be what is happening here too, appearing on the auxiliary which seems to be the grammatical locus of the predicate. It might also be similar to negative forms e.g. in Czech, and those are also treated by means of Polarity, not by independent syntactic words. By the way, a negativiser should be PART, not ADV.

The same is very probably said for the argument-agreeing elements on the verb.

I do not know if you are using them in your annotation, but you might consider to add VerbForms and maybe this could also clarifying what is happening here.

Interestingly (though not directly related to the above question) the object suffix can jump around from verb to case marker and vice versa.

Do you have some glossed examples?

muteeurahman commented 11 months ago

@Stormur Here are some examples: Sindhi is a free word order language where case markers are used to mark subjects, objects, possessiveness, etc. In a normal sentence without any pronominal suffixation case markers identify the object. In the following sentence مون (I) is the subject, ھن(he) is the object marked by the accusative case marker کي and there is no pronominal suffixation.

(i) مون ھن کي نياپو ڏنو I he.Obl Acc-CM message give.perf I gave him a message.

In sentence (ii) given below 1P pronominal suffix is attached to the verb and dropped from the sentence.

(ii) ھن کي نياپو ڏنوم he.Obl Acc-CM message give.perf.1P I gave him a message.

In sentence (iii) given below we can see that the 1P subject, and 2P object along with the case marker are dropped and encoded in the verb with affixes.

(iii) نياپو ڏنومانس Message give.perf.1P.2P I gave him a message.

In sentence (iv) below the 2P object affix is moved to the accusative case marker as an affix without affecting the sentence.

(iv) نياپو کيس ڏنوم Message Acc-CM.2P give.perf.1P I gave him a message

Finally in sentence (v) below we can see another construction where 1P is moved back in the oblique case pronoun and represents the subject while object is affixed to the accusative case marker.

(v) مون کيس نياپو ڏنو I Acc-CM.2P Message give.perf I gave him a message.

Stormur commented 11 months ago

Thanks for taking your time to produce these examples, they made things clearer.

1P is moved back in the oblique case

So is مون actually non-nominative, as it were? When is the non-oblique form used? And why does the blique one appear here?

Of all this analysis I am wondering: is the کي element not more an ADP than a "case marker"? I ask this from seeing the combination کيس, which looks like "person-inflected prepositions" of other languages.

My 2 cents.

If the initial question was if these pronominal affixes need to be split in the syntactic analysis, from what I understand my answer would be no.

Some markers may resemble free pronouns (but ھن != س), but they seem to act quite differently. As long as they appear bound to the verbal forms, I would continue annotating them as part of the morphology, so analysing them as agreements. In the end we will always have the information of, say, a 1st singular person subject n the nucleus of the predicate, but once encoded in a bound way and another one as a "free" pronoun (which is a functional element all the while). Sindhi just seems to not require a redundant marking as in many European languages, e.g. possible it. Io gli ho dato un messaggio 'I gave him a message' (double first person markers, optional pronoun and mandatory verbal suffix).

A more particular case is کيس. Here I would see some reasons to have two elements, an ADP and a PRON, because we also observe the simple occurrence کي. And this would be the perfect example of "fusional" MWT.

In sentence (iv) below the 2P object affix is moved to the accusative case marker as an affix without affecting the sentence.

I would not speak i ndynamic terms of "moving", instead of "necessity of marking somewhere" according to how these elements are expressed.

muteeurahman commented 11 months ago

1P is moved back in the oblique case So is مون actually non-nominative, as it were? When is the non-oblique form used? And why does the oblique one appear here?

Yes, مون is first person singular oblique form and آئون (I) is first person singular nominative. Obliqueness is marking the subject here in perfective aspect. See the following two examples. In imperfective sentences, subjects appear in nominative form while in perfective subjects appear in oblique form.

Imperfective example. آئون ھن کي نياپو ڏيان ٿو. I.Nom he.Obl CM-Acc message give.Imperf be.AUX.Sg I give him a message.

The perfective example. مون ھن کي نياپو ڏنو. I.Obl he.Obl CM-Acc message give.Perf.Sg I gave him a message.

Of all this analysis I am wondering: is the کي element not more an ADP than a "case marker"? I ask this from seeing the combination کيس, which looks like "person-inflected prepositions" of other languages.

Well Yes, کي marks the direct object (accusative case). The syntactic function of most (not all) adpositions is case marking in Sindhi. As case marker is not considered a part-of-speech we consider it ADP in UD Tags (and yes we can call them person-inflected postpositions in Sindhi).

Some markers may resemble free pronouns (but ھن != س), but they seem to act quite differently. Right! It seems that more R&D needs to be done here. I think this is more complex in Sindhi than it seems.

As long as they appear bound to the verbal forms, I would continue annotating them as part of the morphology, so analyzing them as agreements.

Well yes. In syntax analysis, this can be dealt with by using morphological features. However, what happens while defining the dependency relations is another question. Let us go through that process first then we may come up with a better understanding of this situation.

Sindhi just seems to not require a redundant marking as in many European languages, e.g. possible it. Io gli ho dato un messaggio 'I gave him a message' (double first person markers, optional pronoun and mandatory verbal suffix).

Yes, redundant markers are not there in Sindhi.

I would not speak i ndynamic terms of "moving", instead of "necessity of marking somewhere" according to how these elements are expressed.

I am not good at linguistic terms :) so yes we can say it is the "necessity of marking" that either selects the verb or case marker (postposition) to attach the (subject or object) affix.

dan-zeman commented 11 months ago

There are two approaches that can be used in UD to deal with the pronominal suffixes:

To treat the suffixes as syntactic words. ڏنومانس would be treated as a multiword token, split into three syntactic words (the verb-participle ڏنو and the two pronouns). Evidence in favor of this approach could be that the pronominal suffix appears to be mutually exclusive with the full pronoun or with an argument expressed by a noun. Evidence against it seems to be that the suffixes have different form from the full pronouns and that they are suffixes, i.e., nothing else can be inserted between the verb and the suffix – or can it?
To treat the suffixes as agreement suffixes. The verb would get Person and Number features (perhaps also Gender?) for the pronominal suffix representing the subject. If it has also a suffix that represents the object, layered features would be used, e.g., Person[obj] and Number[obj].

The approach with layered features is commonly used in UD for languages with polypersonal agreement, such as Basque. I would be slightly hesitant to use it in Sindhi because polypersonal agreement is not common in Indo-European languages; but it is definitely an option to consider.

As for نه آهي or ناهي, I think it is AUX (a form of "to be", isn't it?) If it can be written as two words (which Google Translate suggests when I ask it to translate "He is not here"), then maybe we should always treat it as two words, i.e., split it to two syntactic words when we encounter it as one orthographic word (ناهي). In that case, the first word would be نه and it would be PART with Polarity=Neg. If, on the other hand, Google's suggestion is wrong and the negative copula should be spelled as one word, than Polarity=Neg would go with the AUX tag.

amir-zeldes commented 11 months ago

From the examples above I would have voted for separating the suffixes as pronoun tokens and putting them in MWTs, since they are clearly not part of the obligatory morphological marking of the verbs' paradigms. If the verb can still 1st person singular without this suffix, then it is not part of a classic synthetic verb inflection, where I would expect morphological features like Person on the verb.

The fact that it is not identical to the pronoun in other positions does not seem important to me, this is also true of German dialects (and spoken standard German), where, for example, "we" is 'wir' preverbally, but "mə" post-verbally as an enclitic - both are viewed as pronouns, not part of the verb.

Stormur commented 10 months ago

The fact that it is not identical to the pronoun in other positions does not seem important to me, this is also true of German dialects (and spoken standard German), where, for example, "we" is 'wir' preverbally, but "mə" post-verbally as an enclitic - both are viewed as pronouns, not part of the verb.

On this I would just like to comment that, in those German variants where we observe something like hammer for standard haben wir,

the plural first-person pronoun is usually already mer or similar in all positions, e.g. Kölsch mer sinn dat = standard wir sehen das 'we see that';
we are in presence of fully predictable additive morphophonology, in that a possible change w (/v/) > m is predictable from the ending -n of plural first-person verbs. For contrast, we have met mer = mit mir 'with me' and not *memmer.

This is to say that, fundamentally, a post-verbal mer is in fact identical to a pre-verbal possible wir (when this happens at all, but then you also have forms like haamwa in Berlin dialect) and they alternate in predictable ways constrained by syntax; also, the verb keeps a separate person marker. This does not seem to be the case between ھن and س- here and this goes very much in favour of treating them as part of morphology.

not part of the obligatory morphological marking of the verbs' paradigms. If the verb can still 1st person singular without this suffix, then it is not part of a classic synthetic verb inflection, where I would expect morphological features like Person on the verb.

Again 2 cents.

I do not see this as too relevant. It is not obligatory and conditioned by other factors, but then we observe different ways of marking it. But then, I would agree slightly more on splitting if it comes out that the "base form" of the verb is a nominal form like a participle, and we are adding possessive suffixes to it. But then again, I would not split things like possessive suffixes in Hungarian (e.g. könyv 'book' > könyvem 'my book'). But then still, the treatment of "inflected adpositions" is tricky... just treat them as part of a pronoun's paradigm with some Case assigned? Does this happen with all adpositions?