Open tenzin3 opened 2 months ago
The Pecha Parser is designed with the following key principles in mind:
The parser operates with the following logic:
from stam import AnnotationStore
stam_obj = AnnotationStore(file="annotation_store_path_str.json")
anns = list(stam_obj.annotations())
ann_data = []
for ann in anns:
curr_data = {}
curr_data["content"] = str(ann)
for data in ann:
curr_data[data.key().id()] = str(data.value())
ann_data.append(curr_data)
print(ann_data)
AnnotationSubStore
Location of annotation data:
When creating annotations in the AnnotationStore
based on those contained in the AnnotationSubStore
using an annotation selector, the annotation data is being stored in the AnnotationSubStore
instead of the AnnotationStore
.
AnnotationStore
annotations function:
The annotations
function in AnnotationStore
is working correctly, but it is also retrieving annotations from the AnnotationSubStore
.
@include
path:
For our project, due to our defined data folder structure, we need to store the base file and annotation store files in separate folders. Instead of saving the AnnotationStore
directly using set_filename
and save
, we convert the annotations to a JSON string using to_json_string
and then modify the path to a relative format like ../../base/7906.txt
. This allows us to keep files in different folders and improves usability. However, in AnnotationSubStore
, when we use to_json_string
from AnnotationStore
, the dependency in AnnotationSubStore
, where we have modified the @include
paths to relative ones, is automatically being converted back to absolute paths.
Unable to load AnnotationStore
with multi-layer AnnotationSubStore
:
When attempting to load an AnnotationStore
with multiple layers of AnnotationSubStore
, it fails with the error stam.PyStamError: [StamError] DeserializationError: Deserialization failed: Expected string or array for @include in AnnotationStore.
1.Text Processing:
2.Condition Check:
3: Modular Design:
4:Custom Pipeline:
For metadata
ann_store: id = Pecha ID
ann_data_set: id = Meta_Data
For Translation
ann_store: id = Pecha ID
ann_data_set: id = Translation
ann_data:
key: Translation_Segment
value: Tibetan_Segment
key: Translation_Segment
value: Englist_Segment
For Root and Commentary
ann_store: id = Pecha ID
ann_data_set: id = Root_Commentary
ann_data:
key: Associated_Alignment
value: Root_Segment
key: Associated_Alignment
value: Commentary_Segment
For OPF
ann_store: id = Pecha ID
ann_data_set: id = Structure_Annotation
ann_data:
key: Structure_Type
value: Chapter
key: Structure Type
value: Tsawa
key: Structure Type
value: Meaning_Segment
Description
We are planning to implement DTS Specifications to develop a text api. One of the key endpoints is the Navigation endpoint, which helps retrieve specific text by allowing users to navigate through the content. To ensure proper functionality and testing of this feature, we are preparing structured annotations for the Chonjuk dataset.
Requirement
Chonjuk Data and Annotation Illustration
Expected Output
An OPF/Pecha for Chonjuk with three levels of structural annotations.
Implementation Steps
annotation on annotation