annotation / stam

Stand-off Text Annotation Model (STAM) is a data model for stand-off-text annotation where any information on a text is represented as an annotation. This repository contains the model's full specification, extensions, schemas, examples and documentation.
https://annotation.github.io/stam/
Creative Commons Attribution Share Alike 4.0 International
17 stars 2 forks source link

Question: How can we efficiently retrieve existing annotation data by searching based on key and value? #32

Open tenzin3 opened 1 month ago

tenzin3 commented 1 month ago
# If ann data already exists, use it . Otherwise create a new one with new id
prepared_ann_data = []
for k, v in ann_data.items():
    try:
        ann_datas = list(ann_store.data(set=ann_dataset.id(), key=k, value=v))
        prepared_ann_data.append(ann_datas[0])
    except:  # noqa
        prepared_ann_data.append(
            {"id": get_uuid(), "set": ann_dataset.id(), "key": k, "value": v}
        )

ann_store.annotate(target=text_selector, data=prepared_ann_data, id=get_uuid())

In ann_data, we have annotation data that we want to associate with an annotation. We aim to avoid creating a new annotation data entry with a new ID if it already exists. If annotation data with the same key and value is already present, we want to link it to the incoming annotation instead of duplicating it. The current code works, but I wanted to know if there's a better solution using the STAM API.

Apparently if the key doesnt exists in the annotation data set, it throws an error.

proycon commented 1 month ago

STAM will already do something similar internally, assigning a new random ID for the annotation data if it is new, and reusing the existing one if not, so you can just pass something like:

ann_store.annotate(target=text_selector, data=[
  {
     "set": ann_dataset.id(), "key": k, "value": v
  },
  {
     "set": ann_dataset.id(), "key": k2, "value": v2
  },
], id=get_uuid())

Note that I omitted the AnnotationData ID here, that means an ID will be assigned automatically. STAM assigns a random 21-char nanoid rather than a uuid, as that takes less space, see https://crates.io/crates/nanoid .

If you really do want to assign the annotationdata ID explicitly, then the method you used is okay, but can be improved slightly for performance inside the try block:

prepared_ann_data.append( next(ann_store.data(set=ann_dataset.id(), key=k, value=v, limit=1)) )