OPT20014: Create Structure Annotations for Chonjuk Data

tenzin3 commented 2 months ago

Description

We are planning to implement DTS Specifications to develop a text api. One of the key endpoints is the Navigation endpoint, which helps retrieve specific text by allowing users to navigate through the content. To ensure proper functionality and testing of this feature, we are preparing structured annotations for the Chonjuk dataset.

Requirement

The data should be Chonjuk data.
Each structural annotation must include necessary metadata, such as ID, title, etc.
Higher-level structural annotations should have the capability to reference or call lower-level structural annotations.

Chonjuk Data and Annotation Illustration

Expected Output

An OPF/Pecha for Chonjuk with three levels of structural annotations.

Implementation Steps

[x] text segmenter
[x] annotate in stam
[x] save annotation
[ ] metadata annotation

annotation on annotation

[x] load higher level annotation
[ ] annotate on higher level annotation

tenzin3 commented 2 months ago

chonjuk data chosen

tenzin3 commented 2 months ago

The Pecha Parser is designed with the following key principles in mind:

High Abstraction: The parser provides a high level of abstraction to simplify its use.
Custom Pipeline Flexibility: Users can create custom pipelines to suit their specific needs.

The parser operates with the following logic:

Input Text: Accepts the text to be processed.
Segmenter: Segments the text using one of the following methods: Space Segmenter New Line Segmenter Regex Segmenter
Annotation Name: Assigns a name to the segmented text. The annotation name must be selected from a predefined list of enums.

tenzin3 commented 2 months ago

Read annotations and its annotation data in stam

from stam import AnnotationStore

stam_obj = AnnotationStore(file="annotation_store_path_str.json")

anns = list(stam_obj.annotations())

ann_data = []
for ann in anns:
    curr_data = {}
    curr_data["content"] = str(ann)
    for data in ann:
        curr_data[data.key().id()] = str(data.value())
    ann_data.append(curr_data)

print(ann_data)

tenzin3 commented 2 months ago

Issues with the new `AnnotationSubStore`

Location of annotation data:
When creating annotations in the AnnotationStore based on those contained in the AnnotationSubStore using an annotation selector, the annotation data is being stored in the AnnotationSubStore instead of the AnnotationStore.
AnnotationStore annotations function:
The annotations function in AnnotationStore is working correctly, but it is also retrieving annotations from the AnnotationSubStore.
@include path:
For our project, due to our defined data folder structure, we need to store the base file and annotation store files in separate folders. Instead of saving the AnnotationStore directly using set_filename and save, we convert the annotations to a JSON string using to_json_string and then modify the path to a relative format like ../../base/7906.txt. This allows us to keep files in different folders and improves usability. However, in AnnotationSubStore, when we use to_json_string from AnnotationStore, the dependency in AnnotationSubStore, where we have modified the @include paths to relative ones, is automatically being converted back to absolute paths.
Unable to load AnnotationStore with multi-layer AnnotationSubStore:
When attempting to load an AnnotationStore with multiple layers of AnnotationSubStore, it fails with the error stam.PyStamError: [StamError] DeserializationError: Deserialization failed: Expected string or array for @include in AnnotationStore.

tenzin3 commented 2 months ago

I have created an issue regarding STAM here.

tenzin3 commented 2 months ago

Framework Design

1.Text Processing:

Split the text into atomic units
An atomic unit is defined as a string split by a new line.

2.Condition Check:

Verify if the atomic units contain specific annotations.
A particular regex sometimes cant extract all annotations.

3: Modular Design:

Each function should perform only one task to ensure high reusability.

4:Custom Pipeline:

Users should be able to create their own custom processing pipelines.

tenzin3 commented 2 months ago

For metadata

ann_store: id = Pecha ID
ann_data_set: id = Meta_Data

For Translation

ann_store: id = Pecha ID
ann_data_set: id = Translation 

ann_data:
    key: Translation_Segment
    value: Tibetan_Segment

    key: Translation_Segment
    value: Englist_Segment

For Root and Commentary

ann_store: id = Pecha ID
ann_data_set: id = Root_Commentary

ann_data:
    key: Associated_Alignment    
    value: Root_Segment

    key: Associated_Alignment
    value: Commentary_Segment

For OPF

ann_store: id = Pecha ID
ann_data_set: id = Structure_Annotation

ann_data:
    key: Structure_Type
    value: Chapter

    key: Structure Type
    value: Tsawa

    key: Structure Type
    value: Meaning_Segment

OpenPecha / toolkit-v2