IntelLabs / vdms

VDMS: Your Favorite Visual Data Management System
MIT License
84 stars 31 forks source link

Bug: Deletion of Descriptors isn't fully supported #201

Open ifadams opened 2 months ago

ifadams commented 2 months ago

Describe the bug

As stated in Wiki: Deletion Capabilities, the _deletion query allows a user to delete the content within VDMS that is associated with a find query (FindImage, FindEntity, FindDescriptor). Currently, descriptor deletion is NOT fully supported.

  1. Metadata is deleted, descriptor is no longer returned in similarity search, but index is still present because the number of returned results are effected after deletion
  2. This is visible in the Filtering on metadata section which starts on botton of page 26 of vdms_latest.pdf. In the section, ID=2 is extracted and displayed. A search for K=3 NN using ID=2 vector as the query is performed and distances displayed. ID is them deleted and then we search again with K=5. You’ll see that only 4 results are returned but ID=2 is excluded.

To Reproduce Steps to reproduce the behavior (as shown in attached document):

  1. Add descriptors with unique property like ID
  2. Complete similarity search using K
  3. Find a descriptor within K using unique property and delete it
  4. Re-run similarity search using K (Notice K-1 results are returned and deleted descriptor isn't present)
ifadams commented 2 months ago

Migrated from internal from @s-gobriel

I think it is important to explain the delete functionality from the VCL side.

The basic functionality of delete IDs works with the following in mind.

The different index engines handle the delete functionality differently, as follows:

• IndexIVF; store the descriptor ids explicitly with the index. As a result, the ids of the other descriptors will not change after a delete operation.

• IndexFlat (other indices in FAISS that we are not supporting in VDMS has the same behavior like IndexPQ, ..etc.). Supports remove_id function which will delete the descriptor in question. However, it is important to understand that this index does not store the IDs explicitly, hence, the delete operation will shift the ids of vectors bigger than the current id by 1.

• IndexFLINNG (no delete operation is supported because for hash_tables delete is not supported)

The logic for VDMS client or the user application need to be modified to map the logic explained above to present the correct vectors to the application after a deletion operation.

Hope this is clear.

BTW, related to the delete functionality, duplicate detection is a trickier issue that can only be handled by the application.

ifadams commented 2 weeks ago

Active discussions underway, updates on diagnosis here:

What's going on is a mismatch between the behavior of the KNN, PMGD, and client expectations.

Currently, we allow an "_expiration" field to be included as part of a descriptor. This field sets a timer for automatic delete (if turned on) which in will automatically delete PMGD graph nodes affiliated with a particular descriptor.

A KNN search returns the nearest neighbors, and the IDs are used internally to increase the specificity of the query.

However, the index the KNN is running over does not always support deletion, and currently internally deletion is not deleted. So its possible that a KNN search returns a "deleted" ID, and since it does not match an existing ID in the graph database, we return nothing.