OPT20005: Chonjuk Alignment parser (4 days)

tenzin3 commented 2 months ago

Description

The task involves parsing Chonjuk sources (including both root text and commentary) and generating an alignment object that can be saved in the STAM data format. STAM is a low-level annotation model coded in Rust. Our goal is to directly implement our data source into STAM without any intermediate layers or abstractions that could slow down data exchange.

Requirement

Proper data validation.(This refers to annotation category and annotation type).
Each annotation file(root text and commentary) should be stored as different individual AnnotationStore.
AnnotationDataset name of all (translation segments and translation alignment file) should have same id.

Expected Output

The expected output is the parsed Chonjuk source (root text and commentary) mapped accurately into the STAM format.

Implementation Plan

Implementation Steps

[x] parse and write translation segment to stam
[x] read segment with pecha class
[x] write alignment to stam
[x] read alignment from stam
[x] write metadata file to stam
[x] read metadata file from stam
[x] modifications (listed here )

kaldan007 commented 2 months ago

done with parser. will discuss the need of wrapper class or not

kaldan007 commented 2 months ago

although able to save the opf in STAM but not able to read the STAM. Commented the issue to STAM maintainer.

ta4tsering commented 2 months ago

working on the metadataselector in stam

tenzin3 commented 2 months ago

Had demo meeting with NT, kaldan and tashi tsering on 25th July 24. Following changes need to be implemented to package

[x] - include layer in alignment mapping
[x] - basefile naming be set to 4 digits
[x] - change name comment -> commentary
[x] - make metadata as a different ann store
[x] - exclude meta data ann from alignment anns

included an additional item as modification

tenzin3 commented 2 months ago

@ngawangtrinley , @kaldan007 , @ta4tsering , @10zinten . If there are any suggestions, please comment below

from chojuk alignment following has been parsed and uploaded to PechaData

Following are the folder name and its corresponding Alignment ID:

D3872-final : AFA7368EE
D3874-final : A4E5EBEAE
D3875-final : A2A8D60A0
D3876-final : A28F2E712
D3877-final : AA3373D83
D3878-final : A0374036F
D3879-final AF7E290F1
D3880-final : A3AB0E101
Kunpal-final: AC669A10E
Thubcho-final: ADB1AE4AB
Thokme-final: A275C7080

Important Notes

From the following , comparing Tsawa/Root files (base file) only Kunpal, Thubcho and Thokme were identical.(Same Tsawa Pecha) Else all tsawa were different.
The folder name(such that D3872-final, ....) are presented in the metadata.json.
MetaData for Pechas been inserted with dummy values due to lack of metadata.
Files with encoding UTF-16 E has been converted to UTF-8 file.
Newlines in files has been normalized.(See below pictures)

Before

After

ngawangtrinley commented 2 months ago

Can you add some translations? By the way, did @10zinten get started on the base update? You will need to transfer the translation segments to the root text. https://github.com/OpenPecha/nalanda-mt

tenzin3 commented 2 months ago

@ngawangtrinley yes I talked with @10zinten and he has already started working on base update mechanism feature. The link you provided has 3 translation tibetan and english pairs. I would do that

Question: Am I to keep the emojis i.e 🔽 in the base file or remove it when making the Pecha. Thank you in advance.

ngawangtrinley commented 2 months ago

Please keep these in a different base version and we will then transfer the bo-en mapping on the clean base. This will help to illustrate the task

tenzin3 commented 2 months ago

@ngawangtrinley

Translation from nalanda-mt has been converted and below are their following alignment id.

TM0876: AA0868822 TM2380: A7CF01CC0 TM3841: A49E97481

Important Note: Translation mapping has been done on clean file. and file with emojis (or qc) has been saved with suffix basefile_name-qc.txt

OpenPecha / toolkit-v2