Improve annotation schema infrastructure

dsj976 commented 1 year ago

Summary

The original annotation schema infrastructure consisted of defining the possible annotation labels as enumerations. This was not a sustainable approach, as it is equivalent to hard-coding the annotation schema and as a result it is hard to update. Additionally, Alembic does not automatically detect changes in Enum values and cannot auto-generate migration scripts (see issue #23).

A better approach is to store the annotation schema in different tables of a relational database. The annotation schema can be specified by the user in a JSON file, which can then be parsed into the database. By specifying the annotation schema in a JSON file, the depth of the annotations (i.e. how many different levels of annotations) can be specified flexibly.

What needs to be done?

[x] Design the SQL schema to store the annotation schema
- [ ] How to ensure data consistency? e.g. addition/removal of labels, renaming of labels, etc.
[x] Create Python classes to manage the annotation schema
- [ ] These classes should have custom methods to add/remove/rename labels safely, ensuring data consistency in the database
[x] Tests
[ ] How to design a flexible front-end that can accommodate a flexible annotation schema (e.g. a variable number of annotation levels)?

Updates

30/10/2023. Database models have been created and the manager classes are work-in-progress. Testing is underway.

dsj976 commented 1 year ago

The annotation schema managers currently support:

Importing an annotations schema from a JSON file for the client, therapist or dyad, and parsing it into the database
Deleting the (whole) annotations schema for the client, therapist or dyad.

The managers should also have methods for updating the annotation schemas without clearing the whole database table and recreating it from scratch. Consider developing the following methods:

A method to rename a specific label (e.g. by providing the label name and the parent ID)
A method to update the schema. This could be useful when adding new labels or removing old labels. This could be done, for instance, by updating the JSON file, creating a temporary database table that can be compared to our main table, and updating only the rows that have changed. These operations can be performed using SQLAlchemy.

dsj976 commented 1 year ago

Work to update the annotation schema for the client is underway. Instead of having a big table with many columns where to store data associated with the annotations, the original annotation table has been broken up into smaller relational tables. See commit 67e6983. This provides greater flexibility as the number of columns is constrained by the SQL table at the time of construction, but the number of rows is not constrained.

Maria-Liakata-NLP-Group / annotations-interface