Open Pipboyguy opened 4 days ago
@Pipboyguy the biggest challenge is to update the document in the database effectively (merge
write disposition). if we chunk a document, should we yield many objects (documents) per chunk or we yield a single compound object with all the chunks inside?
for the latter we already solve this problem for sql databases (child table merge) so the same mechanism could be reused if vector database supports SQL (ie. pgvector).
for other vector databases we'd need to drop all the previous chunks and then re-insert all new chunks (or something more optimized) but I have some doubts if that is scalable?
any thoughts on that?
Feature description
We need to implement a mechanism to efficiently manage chunked documents in both vector databases and SQL stores with vector extensions. This feature should allow for efficient updating and replacing of documents that have been split into multiple chunks for embedding, without leaving orphaned chunks or requiring a full dataset re-embedding.
Use case
When working with large documents in vector databases or SQL stores with vector capabilities, it's common to split these documents into smaller chunks for embedding. This allows for more granular similarity searches and better performance. However, this creates challenges when updating or replacing documents using dlt's merge operation:
Proposed Solution
Assuming the same provider and embedding model is used:
1) Maintain a
doc_id
key as part of a compound merge key. If not explicitly provided, it could even be set as the document, that is the whole text's, hash value. 2) In addition to thedoc_id
, if the user wants to split documents, we hash each chunk's text and set it as the second key aschunk_id
.dlt would then maintain
(doc_id, chunk_id)
merge key by default.