OpenPecha / toolkit-v2

OpenPecha toolkit version 2
MIT License
0 stars 0 forks source link

OPT20005: Chonjuk Alignment parser (4 days) #6

Closed tenzin3 closed 2 months ago

tenzin3 commented 2 months ago

Description

The task involves parsing Chonjuk sources (including both root text and commentary) and generating an alignment object that can be saved in the STAM data format. STAM is a low-level annotation model coded in Rust. Our goal is to directly implement our data source into STAM without any intermediate layers or abstractions that could slow down data exchange.

Requirement

Expected Output

The expected output is the parsed Chonjuk source (root text and commentary) mapped accurately into the STAM format.

Implementation Plan

Image

Implementation Steps

kaldan007 commented 2 months ago

done with parser. will discuss the need of wrapper class or not

kaldan007 commented 2 months ago

although able to save the opf in STAM but not able to read the STAM. Commented the issue to STAM maintainer.

ta4tsering commented 2 months ago

working on the metadataselector in stam

tenzin3 commented 2 months ago

Had demo meeting with NT, kaldan and tashi tsering on 25th July 24. Following changes need to be implemented to package

included an additional item as modification

tenzin3 commented 2 months ago

@ngawangtrinley , @kaldan007 , @ta4tsering , @10zinten . If there are any suggestions, please comment below

from chojuk alignment following has been parsed and uploaded to PechaData

Following are the folder name and its corresponding Alignment ID:

Important Notes

Before

Image

After

Image

ngawangtrinley commented 2 months ago

Can you add some translations? By the way, did @10zinten get started on the base update? You will need to transfer the translation segments to the root text. https://github.com/OpenPecha/nalanda-mt

tenzin3 commented 2 months ago

@ngawangtrinley yes I talked with @10zinten and he has already started working on base update mechanism feature. The link you provided has 3 translation tibetan and english pairs. I would do that

Question: Am I to keep the emojis i.e 🔽 in the base file or remove it when making the Pecha. Thank you in advance.

Image

ngawangtrinley commented 2 months ago

Please keep these in a different base version and we will then transfer the bo-en mapping on the clean base. This will help to illustrate the task

tenzin3 commented 2 months ago

@ngawangtrinley

Translation from nalanda-mt has been converted and below are their following alignment id.

TM0876: AA0868822 TM2380: A7CF01CC0 TM3841: A49E97481

Important Note: Translation mapping has been done on clean file. and file with emojis (or qc) has been saved with suffix basefile_name-qc.txt

Image