OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

[RFC0053] Normalising the particle issues in TMs #185 #193

Open tenzin3 opened 1 year ago

tenzin3 commented 1 year ago

Work Planning

Details ## Table of Contents - [Housekeeping](#housekeeping) - [Named Concepts](#named-concepts) - [Summary](#summary) - [Reference-Level Explanation](#reference-level-explanation) - [Alternatives](#alternatives) * [Rationale](#rationale) - [Drawbacks](#drawbacks) - [Useful References](#useful-references) - [Unresolved questions](#unresolved-questions) - [Parts of the system affected](#parts-of-the-system-affected) - [Future possibilities](#future-possibilities) - [Infrastructure](#infrastructure) - [Testing](#testing) - [Documentation](#documentation) - [Version History](#version-history) - [Recordings](#recordings) - [Work Phases](#work-phases) ## Housekeeping *Please add ref in specified format into `RFC` title, e.g `[RFC9999]` if the corresponding RFW is `[RFW9999]`.* *Please add into this `RFC` and related `PR's` titles `[RFC_id]` e.g `[RFC_9999]`.* ALL BELOW FIELDS ARE REQUIRED ## Named Concepts 1.Botok: Tibetan text tokenizer, here in this RFC we will use their sentence tokenizer feature 2.antx - Annotation Transfer Transfer annotations from source text to destination using diff match patch. ## Summary We are finding issues in our machine translation training data. While creating training data, we have used botok to do some cleaning the text and segmenting them to sentences. Due to some bug in botok, our segments are containing བལྟ་བ་འི་ instead of བལྟ་བའི་. We are facing similar issues with ཨི་ལྡན་ particles ས and ལ་དོན་ particles ར་. Therefore we want to resolve these issues in our TMs. ## Reference-Level Explanation ![flowchart_for_normalization_of_particle_issues_in_TMs](https://github.com/OpenPecha/Requests/assets/52460417/7a652c8d-922a-4e19-9e8c-72719b1821a8) Explanation:> 1.Downloader scripts: To download both the tibetan text files (lets say BO.txt) and same tibetan text files but already sentence segmented and aligned with corresponding english texts.(lets say TM.txt). 2.Pipeline script for sentence tokenizer: script will segment the files (BO.txt files)into sentence using Botok tokenizer called bo_sent_tokenizer. 3.Script to annotation transfer: This script will take the new line positions presented in TM.txt files(source) and will introduced those newlines into BO.txt files(target). 4.Uploader script: After finishing the above steps and solving the affix problem that were occuring before, the new improved files would be uploaded to the github repo. ## Alternatives No alternative. ### Rationale ## Drawbacks No drawbacks ## Useful References https://pypi.org/project/antx/0.1.2/ https://github.com/MonlamAI/MonlamAI_TMs ## Unresolved Questions Before there were affix merging issues with already sentence segmented files (TM..,txt files) but after this request, the translation model wont learn these mistake as a rule. ## Parts of the System Affected New files would be uploaded to https://github.com/MonlamAI/MonlamAI_TMs/tree/main/data with no affix issues. ## Future possibilities ## Infrastructure ## Testing Unit test ## Documentation ## Version History ## Recordings

Work Phases

Planning

Keep original naming and structure, and keep it as the first section in the Work Phases section

Implementation

A list of checkboxes, one per PR. Each PR should have a descriptive name that clearly illustrates what the work phase is about.

Completion

tenzin3 commented 1 year ago

hello @ngawangtrinley , on cleaning the TMs, sentence tokenizer from openpecha mt-tools is tokenizing as follows in few cases. If there is two tsek before shad, one get dissapear

ཧམ་་། ཁྱེད་ལ་སུན་པོ་བཟོས་པར་དགོངས་དག་ཞུ་་། -> ཧམ་། ཁྱེད་ལ་སུན་པོ་བཟོས་པར་དགོངས་དག་ཞུ་།

If there are three tsek before shad, one get disappear and two remains, སྤྱིར་བཏང་ནང་ལ་གནས་ཚུལ་ཨུམ་ཨུམ་་་། -> སྤྱིར་བཏང་ནང་ལ་གནས་ཚུལ་ཨུམ་ཨུམ་་།

I wanted to ask if one tsek or multiple tsek occuring before shad is an grammatical error or not, and how to handle in this cases, 1. keeping the tsek or multiple tsek as it is before shad

  1. keeping one or no tsek as before shad
ngawangtrinley commented 1 year ago

When cleaning the TMs we should normalize punctuation, meaning that all instances of multiple tseks, or illegal sequences like and should be normalized.

Normalization should be done with botok and should be integrated in pybo. Please make sure pybo can be easily called for any NLP task like this. If this task goes beyond the scope of the current RFC, please create an RFW with the tag Utils in the Wishlist board