Closed mzur closed 5 months ago
Adding annotation feature vectors poses a problem with the current approach of a separate vector database (for MAIA). If the vector database gets an added image_annotation_feature_vectors
table, this table must be kept in sync with the main database. This means that, whenever an annotation is added, modified or deleted, the feature vector in the vector database has to be updated, too. What's more, for Largo and later LabelBOT, we also need the label_id
for the annotation. However, each annotation can have multiple label IDs. Here we can either add the label_id
to the table and duplicate the entries for the same annotation but with different label IDs or add a pivot table for annotation IDs and label IDs. But each of these must be kept in sync with any label change now, too. This is quite risky, as we might miss some locations where annotation labels are modified. Also, it might duplicate data unnecessarily.
So I'm now thinking about putting all the feature vectors back into the regular database. This way we could use joins and foreign key constraints to get label IDs without duplication and also automatically delete items when they are no longer needed. Annotation feature vectors can be added/modified/deleted with the same logic that handles the annotation thumbnails.
The main reason to separate the vector database from the main database was that the backups could be separated, too. I don't want to frequently back up a 100 GB database every 10 minutes. So I'm now experimenting with pg_dump --exclude-tables
to create a backup that does not include feature vectors (this makes it necessary to create separate tables for all feature vectors and, e.g., not just add another column to image_annotations
). Also I try pg_dump -Fc --table
to create a dump that only contains the feature vector tables. The Fc
custom archive format is necessary so the individual tables can be selected during the import.
Dumping with -Fc
takes a very long time (maybe because it's compressing at the same time).
I'll now try a combination of pg_dump --exclude-table-data "*_feature_vectors"
and pg_dump --table "*_feature_vectors" --data-only
with the (faster) plain format and see if this can be successfully resored.
Here is a possible strategy to migrate the existing setup:
pg_dump --data-only
and then restore the dump to the regular database.Here is what I found:
label_id
and label_tree_id
directly in the feature vector table, too, and this is only possible with manually managing add/modify/delete operations with annotation labels, what prompted me to investigate this in the first place. Performance is crucial here so this may really be necessary.
Resolves #88
Notes:
Update the feature vector Python script to enable a "single-file" mode where it reads a single file and outputs the feature vector to stdout. This can be used in GenerateAnnotationPatch to avoid reading and writing additional files.Maybe leave the implementation with the CSV file exclusive to MAIA after all? Remove the Trait again.sim-sort-thumbs
branch. This method uses the approximated bounding box of the annotation instead of the whole thumbnail to generate FV. Maybe this can be used to generate FV for remote volumes whereas we generate from original files for locally stored data. It's hard to determine how well the sorting works with this as it looks ok but is not identical to the sorting based on original files.generate-missing
command that submits one job per file. The commant should be made more intelligent so it checks for missing data file by file and groups submitted jobs (with$only
annotations) by file.sim-sort-thumbs
seem to work quite well compared to the "real thing" (I was finally able to compare the sorting on real data). So we can use this to initialize all remote volumes.generate-missing
with the new "ProcessAnnotatedFile" jobs.sim-sort-thumbs
.Implement synchronization between the regular database and the vector database: If an annotation changes, update all feature vectors of the annotation, if a label changes, add/remove a feature vector (there can be several places where this happens).Enable index from the beginning so it doesn't take long to compute for LabelBOT.Unclear how the index should work (with partitioned tables etc.)Make call to CopyFeatureVector in Largo save controller more efficient (copy in batches with insert?)