Efficient Update Strategy for Chunked Documents in Vector Databases

Feature description

We need to implement a mechanism to efficiently manage chunked documents in both vector databases and SQL stores with vector extensions. This feature should allow for efficient updating and replacing of documents that have been split into multiple chunks for embedding, without leaving orphaned chunks or requiring a full dataset re-embedding.

Use case

When working with large documents in vector databases or SQL stores with vector capabilities, it's common to split these documents into smaller chunks for embedding. This allows for more granular similarity searches and better performance. However, this creates challenges when updating or replacing documents using dlt's merge operation:

All chunks related to a document need to be managed during updates.
Existing chunks need to be efficiently deleted and new ones added.
There needs to be a way to associate all chunks of a document with a common identifier.
The solution needs to work across different types of databases, including non-SQL vector databases and SQL databases with vector extensions.

Proposed Solution

Assuming the same provider and embedding model is used:

1) Maintain a doc_id key as part of a compound merge key. If not explicitly provided, it could even be set as the document, that is the whole text's, hash value. 2) In addition to the doc_id, if the user wants to split documents, we hash each chunk's text and set it as the second key as chunk_id.

dlt would then maintain (doc_id, chunk_id) merge key by default.

dlt-hub / dlt