UniversalDependencies / UD_Bambara-CRB

Bambara data.
Other
2 stars 3 forks source link

Lemmatization of auxiliaries in Bambara #5

Open dan-zeman opened 3 years ago

dan-zeman commented 3 years ago

There seem to be several forms of each auxiliary in Bambara, which are not normalized in the lemma field. (I don't understand Bambara but some of the sentences have glosses and the clusters of similar forms usually have the same gloss, so I believe they are forms of the same auxiliary and should have the same lemma.)

I am wondering whether their lemma could be normalized? Also, should the last one (ka) be treated as auxiliary at all? If it is comparable to the infinitive markers to other languages, such as English to, then it probably should be tagged PART and attached as mark?

ftyers commented 3 years ago

I think normalising the lemmas shouldn't be a problem... but it could be that the texts are in different orthographies. In which case it could be that the lemma is the lemma that would be used in that particular orthography, in which case I'm not sure we'd want to normalise them. @KatyaAplonova do you have any thoughts on this?

KatyaAplonova commented 3 years ago

Fran is right, this is due to the different orthographies. Here are my suggestions for normalization:

ye ma bɛ tɛ ka

As far is ka’s pos is concerned, I would keep it as it is. For me, particles are words which modify the whole clause. This is not a case for ka. Moreover, it is connected with polarity, which is a property of auxiliaries.

dan-zeman commented 3 years ago

OK, thanks, Katya. I am in favor of doing the normalization despite the fact that it is different orthographies. It will help identify an auxiliary across the corpus. (It would be cleaner to do the same with all words but it would be probably much more work. The reason why I am now interested specifically in auxiliaries is that they are listed for each language so that the validator can verify them.)

As for ka, I would say that auxiliaries apply to the whole clause, too, so this alone would not distinguish them from particles. (I'm not a big fan of particles but I'm trying to figure out what would lead to better cross-linguistic parallelism.)

In syèdennin yèlènna ka lenburusun yuguyugu (glossed "poussin monter INF agrume.en.général secouer"), there are two clauses connected with xcomp:

xcomp(yèlènna, yuguyugu)

Looking at the current annotation, the second clause is infinitival and its first word is ka "INF", so I think that other treebanks would do

mark(yuguyugu, ka)

here.

dan-zeman commented 3 years ago

Note for myself: query that lists all auxiliary forms in UD Bambara 2.7.

KatyaAplonova commented 3 years ago

OK! Should I normalize them or you will do it?

Infinitive ka has many different functions, in the case of syèdennin yèlènna ka lenburusun yuguyugu it does look like English to, but it can also control negation in the second clause, which, to my mind, is too weird for the particle.

Sorry for the late reply!

dan-zeman commented 3 years ago

OK, thanks for the explanation of ka. I have taken care of the normalization. The validation report contains a few more unknown auxiliaries though. Some of them may simply be annotation errors but at least the following two occur repeatedly:

I don't know what the glosses mean.