OpenPecha / toolkit-v2

OpenPecha toolkit version 2
MIT License
0 stars 0 forks source link

OPT20014: Create Structure Annotations for Chonjuk Data #30

Open tenzin3 opened 2 months ago

tenzin3 commented 2 months ago

Description

We are planning to implement DTS Specifications to develop a text api. One of the key endpoints is the Navigation endpoint, which helps retrieve specific text by allowing users to navigate through the content. To ensure proper functionality and testing of this feature, we are preparing structured annotations for the Chonjuk dataset.

Requirement

Chonjuk Data and Annotation Illustration

Image

Expected Output

An OPF/Pecha for Chonjuk with three levels of structural annotations.

Implementation Steps

annotation on annotation

tenzin3 commented 2 months ago

chonjuk data chosen

tenzin3 commented 2 months ago

The Pecha Parser is designed with the following key principles in mind:

The parser operates with the following logic:

  1. Input Text: Accepts the text to be processed.
  2. Segmenter: Segments the text using one of the following methods: Space Segmenter New Line Segmenter Regex Segmenter
  3. Annotation Name: Assigns a name to the segmented text. The annotation name must be selected from a predefined list of enums.
tenzin3 commented 2 months ago

Read annotations and its annotation data in stam

from stam import AnnotationStore

stam_obj = AnnotationStore(file="annotation_store_path_str.json")

anns = list(stam_obj.annotations())

ann_data = []
for ann in anns:
    curr_data = {}
    curr_data["content"] = str(ann)
    for data in ann:
        curr_data[data.key().id()] = str(data.value())
    ann_data.append(curr_data)

print(ann_data)
tenzin3 commented 2 months ago

Issues with the new AnnotationSubStore

  1. Location of annotation data:
    When creating annotations in the AnnotationStore based on those contained in the AnnotationSubStore using an annotation selector, the annotation data is being stored in the AnnotationSubStore instead of the AnnotationStore.

  2. AnnotationStore annotations function:
    The annotations function in AnnotationStore is working correctly, but it is also retrieving annotations from the AnnotationSubStore.

  3. @include path:
    For our project, due to our defined data folder structure, we need to store the base file and annotation store files in separate folders. Instead of saving the AnnotationStore directly using set_filename and save, we convert the annotations to a JSON string using to_json_string and then modify the path to a relative format like ../../base/7906.txt. This allows us to keep files in different folders and improves usability. However, in AnnotationSubStore, when we use to_json_string from AnnotationStore, the dependency in AnnotationSubStore, where we have modified the @include paths to relative ones, is automatically being converted back to absolute paths.

  4. Unable to load AnnotationStore with multi-layer AnnotationSubStore:
    When attempting to load an AnnotationStore with multiple layers of AnnotationSubStore, it fails with the error stam.PyStamError: [StamError] DeserializationError: Deserialization failed: Expected string or array for @include in AnnotationStore.

tenzin3 commented 2 months ago

I have created an issue regarding STAM here.

tenzin3 commented 2 months ago

Framework Design

1.Text Processing:

2.Condition Check:

3: Modular Design:

4:Custom Pipeline:

Image

tenzin3 commented 2 months ago

For metadata

ann_store: id = Pecha ID
ann_data_set: id = Meta_Data

For Translation

ann_store: id = Pecha ID
ann_data_set: id = Translation 

ann_data:
    key: Translation_Segment
    value: Tibetan_Segment

    key: Translation_Segment
    value: Englist_Segment

For Root and Commentary

ann_store: id = Pecha ID
ann_data_set: id = Root_Commentary

ann_data:
    key: Associated_Alignment    
    value: Root_Segment

    key: Associated_Alignment
    value: Commentary_Segment

For OPF

ann_store: id = Pecha ID
ann_data_set: id = Structure_Annotation

ann_data:
    key: Structure_Type
    value: Chapter

    key: Structure Type
    value: Tsawa

    key: Structure Type
    value: Meaning_Segment