dlangenk commented 5 years ago

More like a nice to have.

I just browsed through the results of novelty detection. Unfortunately the classes are quite scattered, so that selection takes some time. In addition, some classes are much more abundant than others, so the rare classes might be "lost" in the downstream steps. It would be nice to have a "show me more thumbnails that look like this one" mechanism. Algorithms for that are available in image retrieval. We could for example use mpeg7 features or something similar to create a tree structure from the data to make it easier browsable. Creation of that structure shouldn't take much time or resources.

mzur commented 3 years ago

66 should be implemented first.

mzur commented 3 years ago

Idea for the UI: If this feature is active (which is optional or disabled if not enough training data is available), the grid of image patches in MAIA is split vertically (e.g. 80% rows showing the regular patches, 20% rows showing patches suggested by this method). This way the original MAIA workflow is still possible even if this method performs poorly for a given use case.

mzur commented 2 years ago

This can be done with the image features and similarity search implemented for biigle/core#336. The function should be available for training proposals and annotation candidates.

mzur commented 2 years ago

Next idea for the UI: The selected proposal/candidate is shown, fixed and highlighted at the first position in the grid. The remaining grid items are sorted according to the similarity to the patch. They scroll and can be interacted with as usual. The filtering can be enabled with a hover button on each patch. It can be disabled with a button on the highlighted fixed patch.

mzur commented 2 years ago

Updated the title to make clear that this should be implemented both for training proposals and annotation candidates.

mzur commented 1 year ago

With the student experiments based on Dino features and #96 done, this can move forward now.

mzur commented 9 months ago

I want to pick this up again. New thoughts:

Use DINOv2 for feature extraction.
Use pgvector to store the features directly in the database (will work with annotation patches, too).
I thought about using a separate (vector) database for storing the features but 1. it's too convenient to use the existing constraints and logic to update/delete the rows and 2. there are probably no performance issues (right now) with the amount of data we manage.
pgvector supports (indexing) up to 2000 dimensions per feature vector. Each dimension requires ~4 byte. DINOv2 can produce feature vectors with between 384 and 1536 dimensions. A 1536 dim. feature vector would have ~6144 bytes. From a rough estimate, the features of the current BIIGLE image annotations would require >90 GB which is too much, IMO. A 384 dim. feature vector would result in ~23 GB of additional storage. As a start, I'll experiment with patches of size 224x224 and vits14 (384 dims).
We could use PCA for dimension reduction with MAIA but we can't for the other use cases (i.e. Largo) as the annotations are created continuously and we can't know the PCs in advance.
When this goes live (also for regular annotations in Largo) we must think about migrating the database host to a flavor with more storage.
We also have to implement incremental backups, I think (for the "frequent" backup). The hourly backups should be fine even with a much larger DB size.

Here is a notebook with a minimal feature-extraction example with DINOv2: https://colab.research.google.com/drive/1LbtYkzdOezl2SadyxCRJFYhLd_aQNjlq?usp=sharing

mzur commented 9 months ago

Thinking about it, maybe I prefer decoupling the vector database from our main database. With MAIA and Largo it's easy to implement cleanup of vector database rows, since the annotation/candidate/proposal patch files are also cleaned. Cleanup can be asynchronous as well.

This has the advantage that the vector DB does not have an impact on the regular DB backups. It can have it's own (less frequent) backups and be run on a different host.

Laravel can work with different database connections (also for migrations). We only need to sync (and index) the model IDs from the regular DB to the vector DB but this shouldn't be a problem.

I'll still stick with pgvector, as I don't want to introduce a new technology to the stack.

biigle / maia

Use image retrieval techniques to find similiar images #27

66 should be implemented first.