dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
1.98k stars 119 forks source link

Efficient Update Strategy for Chunked Documents in Vector Databases #1533

Open Pipboyguy opened 4 days ago

Pipboyguy commented 4 days ago

Feature description

We need to implement a mechanism to efficiently manage chunked documents in both vector databases and SQL stores with vector extensions. This feature should allow for efficient updating and replacing of documents that have been split into multiple chunks for embedding, without leaving orphaned chunks or requiring a full dataset re-embedding.

Use case

When working with large documents in vector databases or SQL stores with vector capabilities, it's common to split these documents into smaller chunks for embedding. This allows for more granular similarity searches and better performance. However, this creates challenges when updating or replacing documents using dlt's merge operation:

Proposed Solution

Assuming the same provider and embedding model is used:

1) Maintain a doc_id key as part of a compound merge key. If not explicitly provided, it could even be set as the document, that is the whole text's, hash value. 2) In addition to the doc_id, if the user wants to split documents, we hash each chunk's text and set it as the second key as chunk_id.

dlt would then maintain (doc_id, chunk_id) merge key by default.

rudolfix commented 3 days ago

@Pipboyguy the biggest challenge is to update the document in the database effectively (merge write disposition). if we chunk a document, should we yield many objects (documents) per chunk or we yield a single compound object with all the chunks inside?

for the latter we already solve this problem for sql databases (child table merge) so the same mechanism could be reused if vector database supports SQL (ie. pgvector).

for other vector databases we'd need to drop all the previous chunks and then re-insert all new chunks (or something more optimized) but I have some doubts if that is scalable?

any thoughts on that?