forTEXT / catma

Computer Assisted Text Markup and Analysis
https://www.catma.de
GNU General Public License v3.0
87 stars 8 forks source link

Optimized directory and file layout for Collections #304

Closed mpetris closed 1 year ago

mpetris commented 2 years ago

Currently a Collection is a directory with a header.json containing meta data and a subdirectory containing the Annotations of that Collection. Each Annotation sits in its own file. For Collections (or even worse for CATMA Projects) with tens or hundreds of thousands of Annotations this is a performance bottleneck when loading the Annotations.

The reason for choosing this one-Annotation-per-file layout over an all/many-Annotations-in-one-file layout was to avoid git conflicts on creating new Annotations.

The goal is therefore to reach good read and write performance without git conflicts on Annotation creation:

Example file and folder layout for a Collection with two users A and B with B having created more Annotations than fit in one page:

a_collection/
├─ header.json 
├─ annotations/
   ├─ A_1.json
   ├─ B_1.json
   ├─ B_2.json
maltem-za commented 1 year ago

Changes outlined above have now been released with 7.0.0 - the page Upcoming Changes to the Backend Storage Mechanisms and Data Structures also includes a short summary of these changes.

See MAX_ANNOTATION_PAGE_FILE_SIZE_BYTES in /src/main/java/de/catma/properties/CATMAPropertyKey.java for more on the annotation page file size.