fgaim / HornMorpho

Morphological analysis and generation of Amharic, Oromo, and Tigrinya
GNU General Public License v3.0
11 stars 8 forks source link

No FST Available for Tigrinya #1

Open conopt opened 6 years ago

conopt commented 6 years ago

I'm using the package to parse Tigrinya. I tried l3.anal("ti", word) but encountered this error. l3.anal("am", word) works

fgaim commented 6 years ago

Yes, the original project did not build Tigrinya FST for analysis. Hoping to implement it soon.

oyd11 commented 5 years ago

Hi, any update on that? I'm still getting this message: No analysis FST available for ትግርኛ!

fgaim commented 5 years ago

@oyd11, unfortunately, there are no updates on Tigrinya FST yet. Despite my initial plan, the urgency for building this has been reduced due to the fact that unsupervised/pseudo- morphological analysis (in recent years) has proven to be as effective for downstream NLP applications such as MT. That said, I still anticipate some progress in this direction.

oyd11 commented 5 years ago

Thanks @fgaim ! Could I possibly try to do it? What's the effort estimation on that?

I wanted to run some word-rooting and possibly write a verb-conjugator web-app ( sounds suitable right? ) and recalled some working examples from the longer manual.

Tell me - are there instructions somewhere of how to build the Tigrinya-FST? (is it a technical task derived from the files present there?).

Also - if you know - in their some summery in the documentation of Tigrinya conjugation patterns? As a learner of Tigrinya and a hacker, I'll be glad to do something useful in this regard.

fgaim commented 5 years ago

@oyd11 it would be much appreciated if you can work on it. I might also be able to give you a hand at some point.

To begin with, I believe the existing Amharic FSTs could be a good reference. A bit of knowledge of Amharic might be needed to benefit from the work though. The FSTs are basically rules, therefore the task, in my opinion, is more about knowledge of the language itself.

As for documentation, except for the bits and pieces here and there, I have yet to come across a ratified and comprehensive reference for Tigrinya morphology (especially with a usable list of verb templates and patterns). Once again, Michael Gasser's papers on the topic are a good starting point in this direction.

Thanks for your efforts!

Meanwhile, I will reopen this issue.

oyd11 commented 5 years ago

Sounds good. I have some background working with FST (on as part of Kaldi-speech-recognition setups), and a basic understanding of Amharic grammar (I think my Tigrinya is better), what I don't get - is what exists and what doesn't exist - since the Tigrinya-FST folder and some FST files are there (if you don't know either - we should just debug the issue).

Tell me - I see there's the other repository : https://github.com/hltdi/HornMorpho - and these are not (technically) forks of each-other - though I haven't compared them. Is one of them more up to date? (I've tried running both, the only difference I've noticed - is the error message says "Tigrinya" here, and "ትግርኛ" on the hltdi copy ).

Indeed building a Corpus-Based (semi-automatic) morphological analysis - seems more robust, interesting, promising and scalable - especially since there's no real morphological description of Tigrinya anywhere (the closest are Mason and Melles's Grammars, and as you mention, some notes in Gasser's papers), but this has not been done for any language yet afaik.

fgaim commented 5 years ago

I agree, there isn't much literature on Tigrinya grammar, in addition to your list there is a book by Amanuel Sahle that could be worth checking out if you haven't already.

You are right, for a standalone application the data-driven morphological anlyzers are not yet replacements to their rule-based counter parts in terms of accuracy. However, when the target application is another task and the analyzer is used for segmentation or generation, unsupervised algorithms such as morfessor or even simple sub-words operations such as BPE perform as effective or even better. I have observed this on various tasks such as text-representation, classification (including POS), and machine translation.

Based on your experience it seems you are in a position to implement this missing piece towards a high quality morphological analysis of Tigrinya.

Thank you for bringing up the hltdi repo to my attention. From a quick look, it appears that it is more up to date in overall features, but may be not particularly the case for Tigrinya. I will look into it in detail and maybe we can join forces and consolidate to just one repo.

oyd11 commented 5 years ago

ah, Amanuel Sahle's "Sewasiw Tigrinya B'sefihu" - might be the best resource, I haven't checked it out yet, since it's in Tigrinya, browsing the presented tables seems useful, but I would need a native speaker to guide me through it, and explain the nuances. Actually producing a bilingual edition of it might be a cool worth-while project!

I guess practically - I would have to ask Gasser probably; I see there's a requirement for parallel .fst and .cas files ; I'll read the manual section about that to make sure I can ask the right question beforehand.

oyd11 commented 5 years ago

Yes! https://github.com/aalto-speech/morfessor - is exactly the direction I was thinking about, maybe it's already a more worth-while effort to integrate that. Even well-resourced languages with non-trivial morphology (such as Finnish, Hebrew, Turkish..) - do not have spell-checkers working properly because of morphological changes and affixes. A example from my (native) Hebrew - /raiti/ (I saw) ) /kšeraiti/ (when I saw ) /kšereitixa/ (when I saw you) /lixšereitixa/ ( having seen you ) - Tigrinya is very similar to that. [ ראיתי כשראיתי כשראיתיך לכשראיתיך]

Spell checkers will mark some of them as 'red' (for example, the Google-Keyboard spell-checker), and 'worse' - with some verbs - (more common verbs) - some of the forms will be in dictionary, but since there's almost no morphology support right now, other verbs - will reject most of these forms. Obviously the way is to deduce these from a corpus. At-least semi-unsupervised.

fgaim commented 5 years ago

These are great practical examples that motivate the need for a quality morphological analysis. In this regard, I think spell/grammar checkers perhaps benefit most from such a service among other applications. We always come across similar issues for languages with non-concatenative morphology.

I had worked on a morfessor based solution for Tigrinya. Trained on a fairly large monolingual corpus it worked pretty well for affix based word inflections, comparable to human performance. However, that is not the case for the template-pattern structure of stems, which remain problematic, as shown in your example. Down the road, it would be nice to build a benchmark for comparing the data-driven vs. knowledge-driven (as in HornMorpho) approach.

In any case, good luck with your pursuit and let me know if there are things I could help with.

oyd11 commented 5 years ago

Cool, is your morfessor tigrinya setup/results public now? Maybe it is a better investment of effort;

Note that most of my example is concatenative (beyond the a/o vowel change), I think this is the main needed part for Spellchecking / language-aware-search, etc

(and I'm not sure Semitic languages really need a different framework, eg, given Bat-El's claims [ https://telaviv.academia.edu/OutiBatEl ] ), ie, the non-concatenative part changes - are more "high-level" components - for forming new vocabulary, it's more regular in Semitic, but does exist actually everywhere.

fgaim commented 5 years ago

The experiments on morfessor are at least from 2 years back, so I will have to do some digging and publish them in an independent repo, when I get time.

And thanks for sharing the link of OutiBatEl, the claim makes sense. I believe the impact of "the degree of ubiquity" of vowel templates and non-concatenative morphology in Semitic languages varies depending on the target task. I think the root-template analysis may not make much of a difference to a large corpus-driven spellchecker but would be significantly beneficial to a POS model that has to be trained on small labeled data.

Is Tigrinya spell-checking one of your main targets now?

oyd11 commented 5 years ago

That would be cool and interesting to see, I might adapt that to Hebrew as well once I see what you did there...

My main target for now (this is a hobby side-project for now anyhow) - would firstly be just a rough morphology agnostic "Frequency List", ie, find the most frequent 200 words, after rough stemming, I'm thinking of it - as a language learner oriented task. Then maybe do this second-level frequency-list - for words included as predicates of the most frequent words (probably there's a name for that) (or just words that appear in the same sentence) .

More generally - I am thinking of a spell checker - as I was wondering about two issues: 1. I see people texting Tigrinya in Latin script a lot, even though they have Ok-Keyboards (such as your keyboard, I was using it when I've just started learning actually, it's great!), I'm wondering might that be related to not having a spell-checker? 2. The Tigrinya language Wikipedia - not growing, even though there's a non-trivial internet connected educated core of users, I'm also wondering whether that's related.

However - as the example of Hebrew shows - the Hebrew speech community do text in Hebrew script now, and did grow their own Wikipedia, without any morphology aware spell-checker.

Interested to hear your input on these; I'm trying to be pragmatic, so I think a robust Spell-Checker - is not a "low hanging fruit" .

oyd11 commented 5 years ago

btw: I've checked what's up with Hebrew - as I'm actually disappointed by the Spellchecker I have on my Android phone (I mostly Text in Hebrew, I currently don't have any other context to use written Hebrew) -

So - it turns out - there's a 'famous' software package that's apparently integrated "everywhere" (for example in Google-Docs) - it's called "hspell" and "hebstem" (part of the engine), it's open-source, "hand-written" state-machines, in C and Perl [ http://hspell.sourceforge.net , https://code.google.com/archive/p/hebstem/ ].

Apparently - in SMS/what'sapp textfields in Chrome/mobile - it's not integrated yet, they seem to use some generic HMM based spell-checker with little or no morphology in Hebrew (and a funny word predictor where you can just type non-sense pseudo-sentences). Very curious to see whether Corpus-based spellcheckers - for morphology rich languages - surpass these manual ones, and thus writing such a state-machine for Tigrinya become redundant.

Hopefully these tools (like morfessor) can eventually aid "computer-aided-Descriptive-Grammar-writers" - as morphemes can be tagged and referenced.

fgaim commented 5 years ago

You have some nice insights and ideas here, which can serve both Hebrew and Tigrinya among other Semitic languages.

Indeed, many Tigrinya speakers tend to use Latin script when typing (especially on social media). This is a serious issue, in my opinion, as it affects the study of the language by making the text difficult to the process of NLP/ML experts. For example, imagine trying to acquire data and develop a simple application such as sentiment analysis for the languages. I have thought about this issue throughout the years and have also discussed it with other people including users. As you suspected, the lack of a robust spell-checker is one of the contributing factors.

Other reasons that have a lot to do with it include: 1) users don't [mostly] get a Tigrinya IME configured out-of-the-box on their devices, 2) the [small] learning curve of the keyboard mapping still intimidates many people, and 3) the overhead of switching between Geez and the unavoidable Latin keyboard. We attempt to address these with the design of GeezIME, in particular in an upcoming (albeit delayed) release that will support word suggestions and auto-completion.

In relation to the above issue, a fun project would be to harvest a large amount of the Latin typed texts, analyze spelling patterns, and learn to automatically convert them to proper text, making it useful many applications.

In any case, with the growing amount of text in all languages, the corpus-driven approaches seem the most appealing and pragmatic solutions be it for spell-checking or non-exact linguistic analysis such as morphology and dependency.

I am interested to see the outcome of your current effort along these lines. We can also discuss a potentially larger open-source project on the topic. Please email me to fitsum at geezlab.com as this discussion is drifting from the OP's original issue.

yosiasz commented 3 years ago

Willing to lend a hand with Tigrinya