UniversalDependencies / UD_Amharic-Inku

Other
0 stars 2 forks source link

lemmatization with Amharic #2

Closed yosiasz closed 2 weeks ago

yosiasz commented 2 weeks ago

greetings,

question about the following in Amharic

Amharic: እውነቱን ቢናገርስ። (iwnetun binagris) word for word means "the truth he spoke what if." እውነት (iwnet) = truth እውነቱን (iwnetun) = the last ቱን makes it "the truth" ቢናገርስ (binageris) = he spoke what if. the root verb is = መናገር = to speak ቢናገር = if he spoke ቢናገርስ = if he spoke what if. The last ስ makes it conditional what if

Stemming is going to require a lot of work because Amharic is such a morphologically complex language. But I am ready for the challenged. So let's take the verb መናገር (menager) to speak which will have many flavors/permutations (male/female, younger older, past, present, future) when used in sentences

In English getting the stem is pretty straight forward ( I think)

to change becomes I change you change he changes etc

then you have changer, changing, changed, etc but the root stays pretty intact. So in the case of change group of words (changer, changing, changed) is the lemma chang or change ?

This might be able to help me sort this out in Amharic? Is this a technical UD thing or is it a linguistic thing

Thanks!!

dseddah commented 2 weeks ago

Hi, I would really suggest to follow the strategy that was implemented for other Semitic treebanks (Hebrew, Arabic). Out of memory - could be wronng, it’s been a while -, what they did was to be provided full morphological analysis, pick the right one in case of ambiguities, and use the root for the lemma. They had the chance of being able to rely on pre-existing morphological analyzers.

Another strategy that we followed for a north african arabic dialect with no ressources whatsoever was in the first phase to not try to decompose anything beyond obvious tokenization and provide French glosses, instead of lemmas, with attached suffix or prefix. It’s only in a second phase that we added more morphological analysis for nominal constructions. We’ll do the verbs in the futur (fingers crossed).

Good luck, Djamé

Le 24 juin 2024 à 17:22, ዮስያስ @.***> a écrit :

greetings, question about the following in Amharic Amharic: እውነቱን ቢናገርስ። (iwnetun binagris) word for word means the truth he spoke what if. እውነት (iwnet) = truth እውነቱን (iwnetun) = the last ቱን makes it "the truth" ቢናገርስ (binagris) = he told what if. the root verb is = መናገር = to speak ቢናገር = if he spoke ቢናገርስ = if he spoke what is. the last ስ makes it conditional what if Stemming is going to require a lot of work because Amharic is such a morphologically complex language. But I am ready for the challenged. So let's take the verb መናገር (menager) to speak which will have many flavors/permutations (male/female, younger older, past, present, future) when used in sentences In English getting the stem is pretty straight forward ( I think) to change becomes I change you change he changes etc then you have changer, changing, changed, etc but the root stays pretty intact. So in the case of change group of words (changer, changing, changed) is the lemma chang or change ? This might be able to help me sort this out in Amharic? Is this a technical UD thing or is it a linguistic thing Thanks!! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

yosiasz commented 2 weeks ago

thanks so much! crawl before walking is what you are saying, I agree 100%

Could you expound on the below, I did not full understand.

what they did was to be provided full morphological analysis, pick the right one in case of ambiguities, and use the root for the lemma.

they were provided with morphological analysis by someone else or they provided it? Could you please provide an example (hebrew or Arabc) for "use the root for the lemma"?

Thanks so much

nschneid commented 2 weeks ago

Lemmatization conventions differ by language and sometimes even from treebank to treebank within a language.

For verbs in Hebrew, I am aware of IAHLTwiki using the 3rd person masculine singular past form—not the consonantal root—as the lemma. This is the citation form that would normally serve as the head word in dictionaries. @amir-zeldes could elaborate in more detail on policies for Hebrew.

yosiasz commented 2 weeks ago

@nschneid just for the sake of exploration, the word dream has similarities between hebrew arabic and amharic

am - ህልም (hilm) heb - חולם ar - حلم

to dream (dreaming) am - ማለም - unconjugated verb

What would 3rd person masculine singular form be in Hebrew? Trying to see if I can find some correlation though I understand it can differ by language and treebank.

thanks

nschneid commented 2 weeks ago

Heb: the lemma of the verb חולם would be חלם ħalam

yosiasz commented 2 weeks ago

ok so not לַחְלוֹם. in Amharic for starters I will go ahead and use the unconjugated verb as root and adjust as needed per context etc

appreciate it!

dan-zeman commented 2 weeks ago

Since there is already one small Amharic treebank in UD, you should definitely have a look at it. This is not to say that everything they did must be correct just because they were first; but it is desirable that the annotations are compatible in the end. There is also Amharic UD documentation, which you should either adhere to or propose its modifications.

yosiasz commented 2 weeks ago

thanks @dan-zeman I have taken a look at that. It is a good treebank and will use it as a point of reference only. I am not sure I want to be tied down in correcting issues with a treebank someone else has created and delay my work, if you understand what I mean. For example that treebank mixes a lot Amharic and Tigrinya when it comes to verbal features alone. So what I propose is the following if you agree, I create my treebank while referencing the official am treebank but with the corrections implemented in the treebank I will submit, then the creators of the other tree bank can choose to fix the issues I will mention in my tree bank? How does that sounds?

I will propose modifications to Amharic UD documentation

yosiasz commented 2 weeks ago

@nschneid just spent few hours doing some research with my sister and like Hebrew we also ended up concluding 3rd person masculine is indeed the lemma we will use as it abundantly shows up on 80% of verb conjugations we iterated over 😱

thanks!

megasser commented 1 week ago

HornMorpho can do stemming of Amharic: https://github.com/hltdi/HornMorpho Let me know if you need help with this (how to stem may not be obvious from the documentation).