biigle / largo

:m: BIIGLE module to review image annotations in a regular grid
GNU General Public License v3.0
0 stars 1 forks source link

Outlier detection #120

Closed mzur closed 5 months ago

mzur commented 7 months ago

Resolves #88

Notes:

mzur commented 6 months ago

Adding annotation feature vectors poses a problem with the current approach of a separate vector database (for MAIA). If the vector database gets an added image_annotation_feature_vectors table, this table must be kept in sync with the main database. This means that, whenever an annotation is added, modified or deleted, the feature vector in the vector database has to be updated, too. What's more, for Largo and later LabelBOT, we also need the label_id for the annotation. However, each annotation can have multiple label IDs. Here we can either add the label_id to the table and duplicate the entries for the same annotation but with different label IDs or add a pivot table for annotation IDs and label IDs. But each of these must be kept in sync with any label change now, too. This is quite risky, as we might miss some locations where annotation labels are modified. Also, it might duplicate data unnecessarily.

So I'm now thinking about putting all the feature vectors back into the regular database. This way we could use joins and foreign key constraints to get label IDs without duplication and also automatically delete items when they are no longer needed. Annotation feature vectors can be added/modified/deleted with the same logic that handles the annotation thumbnails.

The main reason to separate the vector database from the main database was that the backups could be separated, too. I don't want to frequently back up a 100 GB database every 10 minutes. So I'm now experimenting with pg_dump --exclude-tables to create a backup that does not include feature vectors (this makes it necessary to create separate tables for all feature vectors and, e.g., not just add another column to image_annotations). Also I try pg_dump -Fc --table to create a dump that only contains the feature vector tables. The Fc custom archive format is necessary so the individual tables can be selected during the import.

mzur commented 6 months ago

Dumping with -Fc takes a very long time (maybe because it's compressing at the same time).

I'll now try a combination of pg_dump --exclude-table-data "*_feature_vectors" and pg_dump --table "*_feature_vectors" --data-only with the (faster) plain format and see if this can be successfully resored.

mzur commented 6 months ago

Here is a possible strategy to migrate the existing setup:

mzur commented 6 months ago

Here is what I found: