[RFC0016] RDRSegmentation for Tibetan and pybo / botok integration

tenzin3 commented 1 year ago

Work Planning

Details

- [Housekeeping](#housekeeping) - [Named Concepts](#named-concepts) - [Summary](#summary) - [Reference-Level Explanation](#reference-level-explanation) - [Alternatives](#alternatives) * [Rationale](#rationale) - [Drawbacks](#drawbacks) - [Useful References](#useful-references) - [Unresolved questions](#unresolved-questions) - [Parts of the system affected](#parts-of-the-system-affected) - [Future possibilities](#future-possibilities) - [Infrastructure](#infrastructure) - [Testing](#testing) - [Documentation](#documentation) - [Version History](#version-history) - [Recordings](#recordings) - [Work Phases](#work-phases)

Housekeeping

Named Concepts

- **Botok:** rule based Tibetan tokenizer - **Pybo:** library of NLP tools for Tibetan, including Botok - **Affixed particles:** syllables merged with the previous syllable. (i.e. [ངའི]་[ཨ་མས]་བཤད་བྱུང་། - **POS:** Part of Speech - **RDRPOSTagger:** Ripple down rules tagger - **CQL:** Corpus Query Language (https://www.sketchengine.eu/documentation/corpus-querying/) - **CQLR:** Token adjustment rules using the CQL syntax. (see pybo) - **HFR:** CQLR rules expressed in Tibetan script and in a user-friendly format (see pybo) - **Gold Corpus:** refers to a carefully annotated or manually labeled collection of texts or language data that serves as a high-quality reference or benchmark for evaluating and training computational models or studying linguistic phenomena.

Summary

*There are some words now which are not segmented properly after the max match which is currently implemented in botok. So training RDR model on that maxmatch output will give us additional rules (.RDR) which could be convert to CQL and be added to botok.*

Reference-Level Explanation

**Fig: Training RDR model on output of botok (Max match)** ![image](https://github.com/OpenPecha/Requests/assets/52460417/56fdd5f1-4b8a-4fe6-a1a9-3e79b2d2afee) **Some important notes** *1.Number of syllabes in the Gold Corpus and Max match output should be same before Tagger.py will run.* *2.Dictionary generated from RDR model (word and tag label) should be included in word lexicon with attributes(lemma, pos,...) because RDR rules some of the rules in if condition has to compare with tag value i.e* *if object.**previousTag == "N"** and object.Word == "ལ་ལ་" : object.conclusion = "P"* **There were condition with tag value comparison in syllabys based RDR, but if word based RDR doesn't produce any rules with tag value comparison then i dont see the need to add a new column for RDR_Label in the word lexicon.** **Fig: Working of Botok after Integration** ![image](https://github.com/OpenPecha/Requests/assets/52460417/1f09c038-2287-4502-a698-81fa10fe962d) *uncompound_lexicon.tsv is just an example, the particular word can be from ancient.tsv as well.* **Tag Explanation** 1.P: Perfect (which means the current word is correctly segmented.) 2.N:Start of the new word. 3.C:Continuation of the new word -*If a word is not tagged P then, each syllable in a word will be marked N or C.(i.e if a word has 3 syllable, then ther should be three N or C tags.)* -*Using the above tag, we could have the following formula* -*Numer of words in Gold Corpus = Number of N + Number of P (In tagged file made on max match output)* **Some examples** 1.String: ལ་ལ་ལ་ལ་ལ་ If max match output: [ལ་ལ་ལ་][ལ་ལ་] Gold Corpus: [ལ་ལ་] [ལ་] [ལ་ལ་] Tagged output: ལ་ལ་ལ་/NCN ལ་ལ་/P Possible RDR Rules: if object.word == "ལ་ལ་ལ་" and object.nextWord == "ལ་ལ་" : object.conclusion = "NCN" if object.previousword == "ལ་ལ་ལ་" and object.Word == "ལ་ལ་" : object.conclusion = "P" CQL Rule: if [text_cleaned='ལ་ལ་ལ་'][text_cleaned='ལ་ལ་'] then [ལ་ལ་] [ལ་] [ལ་ལ་] 2.String: ལ་ལ་ལ་ལ་ལ If max match output: [ལ་ལ་][ལ་ལ] [ལ་] Gold Corpus: [ལ་ལ་] [ལ་] [ལ་ལ] Tagged output: ལ་ལ་/P ལ་ལ/NN ལ་/C **Steps Explanation** Gold Corpus: [ལ་] [ལ་ལ་] [ལ་] [ལ་བ་] [ཡོད] [།] , here the word are actually separated by a single space. Preprocessing output: ལ་ལ་ལ་ལ་ལ་བ་ཡོད། , this is done on gold corpus before senting to max match. Max match output: [ལ་ལ་] [ལ་ལ་] [ལ་བ་] [ཡོད] [།] Tagged output: ལ་ལ་/NN ལ་ལ་/CN ལ་བ་/P ཡོད/P །/P , this output is generated on max match ouput according to gold corpus. Ajustment rule:> if [ལ་ལ་] [ལ་ལ་] then [ལ་] [ལ་ལ་] [ལ་] **Improving RDR model rules with human annotators assistance.** 1.Improving with Gold Corpus i)Combine all the gold corpus and combine into one ii)Preprocess it, and get the max match output from botok. iii)Train the RDR model, and get the dictionary .DICT and model .RDR iv)Tag the same max match output with trained RDR model and produce the output without tags. v)In the output file, the difference made by RDR after max match would be highlighted or different. vi)Human annotators checks only the highlighted words see if RDR has done correctly, if not change by splitting or merging the syllables resulting in words separated by space. vii)Tag the output from latest gold corpus from human annotators and train the RDR again and update RDR. 2.Improving with normal Tibetan language text files i)Preprocess the files, and get the max match output from botok. ii)Tag the files with already trained RDR model and produce the output without tags. iii)In the output file, the difference made by RDR after max match would be highlighted or different. iv)Human annotators checks only the highlighted words see if RDR has done correctly, if not change by splitting or merging the syllables resulting in words separated by space. v)Tag the output from latest gold corpus from human annotators and update the RDR again. *After last step of both above ways, RDR will be then converted to CQL and then update it to botok.*

Alternatives

*Confirm that alternative approaches have been evaluated and explain those alternatives briefly.* 1.***Train syllable based RDR rules** (Rules will have to be created for every possibility, including what is covered by maxmatch. A majority of the rules are handled by maxmatch i.e. རིན་པོ་ཆེ་)*. 2.***Expand the context window of RDR** to more than 5 tokens (Seems too time costly as a complete rewrite of RDR will be required. The gain in context might not be much bigger than if using maxmatch)*.

Rationale

- Why the currently proposed design was selected over alternatives? *Training and implement **Word based RDR** has been selected because max match covers majority of the rules and word based RDR will generate rules that arent covered based on max match output*. - What would be the impact of going with one of the alternative approaches? *Already covered in Alternatives*. - Is the evaluation tentative, or is it recommended to use more time to evaluate different approaches? *TBD*.

Drawbacks

1.*Word adjustment rules won't include token attributes such as lemma, pos, etc*. 2.*Word segmentation adjustment and POS tag adjustment might need to be done separately, and thus word adjustment rules won't be able to use other token attributes*.

Useful References

-*RDR Segmenter and Post tagger* by datquocnguyen. Segmenter code has been written in Java and Post tagger in Python 2.x. https://github.com/datquocnguyen/RDRsegmenter. -*RDR Post tagger* by datquocnguyen in python 3.x. https://github.com/datquocnguyen/RDRPOSTagger

Unresolved Questions

- When making rules, should there be different dictionary and rules in directories such as 'generals' and 'script prayers'? I believe not since, word based RDR will generate only exception rules for adjusting so i dont think, but if we were to add RDR_Label in words, then it is to put next to works in proper directory(already listed in botok). - Does the current pybo/rdr/rdr_2_replace_matcher.py works? for five windows? because i didint saw variable such as(prevWord2, nextWord2,...) only till prevWord.If not, make a new script - I have decided to train RDR model in token attribute 'text_cleaned' rather than 'text', result to be determine... - I believe it is neccessary to a make a new column 'RDR_label' from dictionary from .RDR in \Documents\pybo\dialect_packs\general\dictionary\words for the word based RDR? the answer will be known after training and tagging RDR *Above two unresolved questions effects in writing RDR_2_CQL.py* - When updating the RDR rules, which would be better? i)Training with all the stored gold corpus files from the beginning, or ii)append the new rules after the previous respective tags in old rules (the method to be depend on i)how many files will be used for training ii)how many human annotators will be there iii)number of changes RDR will do after max match for a single page(most important factor)) - Does the current rules in rdr_basis.tsv is needed? TBD

Parts of the System Affected

- Which parts of the current system are affected by this request? CQL file will be added more rules.(pybo\dialect_packs\general\dictionary\rules\rdr_basis.tsv)[already existed file] New column 'RDR_label' to be add in pybo\dialect_packs\general\dictionary\words\...tsv and add the attribute in Token class(possibility) - Does this request depend on the fulfillment of any other request? 1.This request depends on the output of the botok(max match) algorithm. 2.pybo/corpus/word_cleanup.py for cleaning Gold Corpus(so that it will have same syllables as max match output). - Does any other request depend on the fulfillment of this request?* No, this request helps on adding more rules and improve botok in word segmentation. But if botok adds more dictionary, running this package again and rules to the CQL will be beneficial. **This package will be added in pybo/word_based_rdr(new folder)**

Future possibilities

*How do you see the particular system or part of the system affected by this request being altered or extended in the future?* The RDR model can be used for adding rules and improving upon any word segmentation algorithm we modify or change in the future. The RDR model can be used for POS tagging as well.

Infrastructure

Testing

Documentation

Version History

Recordings

Work Phases

1.Preprocessing script for Gold Corpus(removing signs , and cleaning up to have same text_cleaned as max match output.

2.Script for comparison gold standard segmentation with max-match output.

3.Tagger.py for tagging Max match output with "word adjustment labels" to describe the adjustment opperation required.

4.Training RDR model on on a corpus tagged with "word adjustment labels" (Tagged_file).

5.Tagging and Analyzing the RDR results on max match output and checking if model is performing like its supposed to.

6.RDR_to_CQL Convertor.py the RDR rules with the word adjustment tags to CQL

7.Test and integrate the adjustment to botok, includes checking if adding new label(RDR label) is just or not.

8.Script for human annotators and updating the RDR

Planning

Keep original naming and structure, and keep it as the first section in the Work Phases section

[x] RFC completed on: Estimated time: Actual time:
[x] RFC reviewed and approved by: Estimated time: Actual time:

Implementation

A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.

[x] OpenPecha/word_based_rdr#2 Estimated time: 1 hour Actual time:
[x] OpenPecha/word_based_rdr#3 Estimated time: 1 hour Actual time:
[x] OpenPecha/word_based_rdr#4 Estimated time: 1 day Actual time:
[x] OpenPecha/word_based_rdr#5 Estimated time: 1 hour Actual time:
[x] OpenPecha/word_based_rdr#6 Estimated time: 3 hour Actual time:
[x] OpenPecha/word_based_rdr#7 Estimated time: 1 day Actual time:
[x] OpenPecha/word_based_rdr#8 Estimated time: 4 hour Actual time:
[x] OpenPecha/word_based_rdr#9 Estimated time: 4 hour Actual time: