Outlier detection - Githubissues

mzur commented 7 months ago

Resolves #88

Notes:

[x] Move feature vector code from biigle/maia to biigle/largo.
[x] Make code in biigle/maia reuse/subclass code from biigle/largo to generate feature vectors.
[x] Merge the vector database into the regular database (see explanation below). This is done in https://github.com/biigle/maia/pull/150
[x] Implement ImageAnnotationLabelFeatureVector and VideoAnnotationLabelFeatureVector models
[x] Extend GenerateAnnotationPatch jobs to generate feature vectors, too (in the same job).
[x] GenerateAnnotationPatch should update existing feature vectors if they already exist
[x] Handle added feature vectors because of added annotation labels. (reuse existing feature vector)
[x] Largo patches are computed with a padding but the feature vectors are generated from the "tight" box around the annotation.
[x] Update the feature vector Python script to enable a "single-file" mode where it reads a single file and outputs the feature vector to stdout. This can be used in GenerateAnnotationPatch to avoid reading and writing additional files.
- ~~Maybe leave the implementation with the CSV file exclusive to MAIA after all? Remove the Trait again.~~
- Or use a RAMDisk to exchange the files (use /dev/shm which is 64 MB and should be enough by default in Docker?)
[x] Handle feature vectors of whole frame annotations, too.
[x] Unify CPU and GPU workers. Both use the same (PyTorch) Docker image and adapt to the available hardware.
- Or add pytorch CPU to the CPU worker image because it is much smaller?
[x] Implement the UI
- This will be implemented as a "sorting" tab as explained in https://github.com/biigle/largo/issues/97. The initial sorting options will be "ID" (i.e. ascending order of creation date) and "outliers" (i.e. the outliers first). Outliers will be determined by computing the average feature vector of all shown annotations and then sorting by most dissimilar. Users should be able to choose the sorting direction just as with the volume overview.
- [x] Volume UI
- [x] Project UI
[x] Transform the "GenerateAnntationPatch" job to a "ProcessAnnotatedFile" job that generates all patches, feature vectors (and SVGs #119) for a file. The job has an optional "only" argument where it can be limited to generate only stuff for a single annotation. It also has optional arguments to disable generation of patches, feature vectors (and SVGs).
- [x] Update clone volume code (https://github.com/biigle/core/pull/730)
- [x] Update biigle/sync code (https://github.com/biigle/sync/pull/41)
- [x] Update biigle/maia code (https://github.com/biigle/maia/pull/149)
[x] Implement a console command to generate feature vectors of existing annotations. This uses only the Largo annotation patches (thumbnails) so the original images don't have to be downloaded/processed. This is a tradeoff between processing speed and feature vector quality.
- Test if the sorting results are about the same than with full-resolution feature vectors
- The sorting is worse than with full-resolution feature vectors. We need to find a solution to initialize the vectors from the original files.
- I've implemented an updated FV initialization based on thumbnails in the sim-sort-thumbs branch. This method uses the approximated bounding box of the annotation instead of the whole thumbnail to generate FV. Maybe this can be used to generate FV for remote volumes whereas we generate from original files for locally stored data. It's hard to determine how well the sorting works with this as it looks ok but is not identical to the sorting based on original files.
- The with the new "ProcessFile" job, the console command here will be the existing generate-missing command that submits one job per file. The commant should be made more intelligent so it checks for missing data file by file and groups submitted jobs (with $only annotations) by file.
- The changes in sim-sort-thumbs seem to work quite well compared to the "real thing" (I was finally able to compare the sorting on real data). So we can use this to initialize all remote volumes.
- [x] Update generate-missing with the new "ProcessAnnotatedFile" jobs.
- [x] Finalize and merge the changes of sim-sort-thumbs.
[x] Update the manual with sorting instructions
[x] Implement synchronization between the regular database and the vector database: If an annotation changes, update all feature vectors of the annotation, if a label changes, add/remove a feature vector (there can be several places where this happens).
[x] Clean up feature vectors on annotation/file/volume/project delete.
[x] ~~Enable index from the beginning so it doesn't take long to compute for LabelBOT.~~ Unclear how the index should work (with partitioned tables etc.)
[x] Integrate biigle/largo in biigle/schema
[x] Make sure feature vector generation works with the sync import
- Check why annotation patches are processed before image thumbnails are generated while at it.
[x] Test the feature for images and videos (by importing some volumes)
[x] Don't execute the CopyFeatureVector job when an annotation is created
[x] ~~Make call to CopyFeatureVector in Largo save controller more efficient (copy in batches with insert?)~~
- Too complicated. Can do this if we run into performance issues.
[x] Create an issue to implement the "MAIA style" sorting by a single patch, too (later). https://github.com/biigle/largo/issues/125
- If implemented like in MAIA, make the button to select a reference hidable in the (upcoming) settings tab.
- Otherwise it could be implemented in the sorting tab. Select the sorter, then click on a patch and then it will sort. But the MAIA style would be more intuitive I think.
[x] Maybe think of a different title for sorting by "ID"? Or add a help text? Also update this in the volume overview if changed.
- I chose "created" for Largo but left "ID" in the volume overview, as it makes more sense there.
[x] Update the clone volume and import/sync features to use the new "ProcessAnnotatedFile" jobs.

mzur commented 6 months ago

Adding annotation feature vectors poses a problem with the current approach of a separate vector database (for MAIA). If the vector database gets an added image_annotation_feature_vectors table, this table must be kept in sync with the main database. This means that, whenever an annotation is added, modified or deleted, the feature vector in the vector database has to be updated, too. What's more, for Largo and later LabelBOT, we also need the label_id for the annotation. However, each annotation can have multiple label IDs. Here we can either add the label_id to the table and duplicate the entries for the same annotation but with different label IDs or add a pivot table for annotation IDs and label IDs. But each of these must be kept in sync with any label change now, too. This is quite risky, as we might miss some locations where annotation labels are modified. Also, it might duplicate data unnecessarily.

So I'm now thinking about putting all the feature vectors back into the regular database. This way we could use joins and foreign key constraints to get label IDs without duplication and also automatically delete items when they are no longer needed. Annotation feature vectors can be added/modified/deleted with the same logic that handles the annotation thumbnails.

The main reason to separate the vector database from the main database was that the backups could be separated, too. I don't want to frequently back up a 100 GB database every 10 minutes. So I'm now experimenting with pg_dump --exclude-tables to create a backup that does not include feature vectors (this makes it necessary to create separate tables for all feature vectors and, e.g., not just add another column to image_annotations). Also I try pg_dump -Fc --table to create a dump that only contains the feature vector tables. The Fc custom archive format is necessary so the individual tables can be selected during the import.

mzur commented 6 months ago

Dumping with -Fc takes a very long time (maybe because it's compressing at the same time).

I'll now try a combination of pg_dump --exclude-table-data "*_feature_vectors" and pg_dump --table "*_feature_vectors" --data-only with the (faster) plain format and see if this can be successfully resored.

mzur commented 6 months ago

Here is a possible strategy to migrate the existing setup:

Create MAIA feature vector tables with the exact same names and columns in the regular database (via Laravel migrations). Proposal/candidate IDs can become foreign keys with constraints, so they don't have to be manually deleted any more.
Update the MAIA code to use these tables instead.
Instruct instance admins to generate a database dump of the vector database with pg_dump --data-only and then restore the dump to the regular database.
After this the separate database can be removed.

mzur commented 6 months ago

Here is what I found:

Backup and restore of separated feature vectors and a single database is possible with the commands shown above
I will merge the separate vector database into the regular database to benefit from foreign key constraints. The main argument for separate databases was the backup and this point is now moot.
Joins will not be necessary for sorting in Largo because the annotation feature vectors can also directly store the associated volume ID
Joins may be impractical for LabelBOT because they may be too slow with large tables (this needs to be investigated once this landed in Largo and we have the actual database). This means that we might store the label_id and label_tree_id directly in the feature vector table, too, and this is only possible with manually managing add/modify/delete operations with annotation labels, what prompted me to investigate this in the first place. Performance is crucial here so this may really be necessary.
An alternative to adding redundant information to the annotation feature vector tables may be a Postgres materialized view that is regularly updated. This would double (!) the storage requirements for the feature vectors, though, which almost immediately rules this solution out. The table might as well be regularly refreshed with a scheduled job in Laravel. Trading off a slightly outdated table in favor of less implementation effort/logic may be a viable solution.

biigle / largo

Outlier detection #120