Open Pipboyguy opened 5 months ago
@Pipboyguy the biggest challenge is to update the document in the database effectively (merge
write disposition). if we chunk a document, should we yield many objects (documents) per chunk or we yield a single compound object with all the chunks inside?
for the latter we already solve this problem for sql databases (child table merge) so the same mechanism could be reused if vector database supports SQL (ie. pgvector).
for other vector databases we'd need to drop all the previous chunks and then re-insert all new chunks (or something more optimized) but I have some doubts if that is scalable?
any thoughts on that?
@rudolfix Thank you for providing some specificity and color to the issue.
I came up with a method (see updated issue description), where only the chunks are hashed according to their contents, i.e. merge key (doc_id, chunk_hash)
.
What has worked in my experimentation is yielding a single compound object for each document, with all chunks inside. This specific combination, at least with SQL databases, groups the embeddings neatly in a child table. (I'd imagine there are other creative ways to do this as well)
It works on any sql store with delete-insert
strategy:
import hashlib
import random
from typing import List, Generator
import dlt
from dlt.common.typing import DictStrAny
def get_md5_hash(data: str) -> str:
return hashlib.md5(data.encode("utf-8")).hexdigest()
def mock_embed(dim: int = 10) -> str:
# For purpose of illustration I return string here instead of compound (list) type here to prevent
# further normalization. this will usually be a vector/numpy indivisible type in target db so it won't be normalized then.
return str([random.uniform(0, 1) for _ in range(dim)])
def chunk_document(doc: str, chunk_size: int = 10) -> List[str]:
return [doc[i : i + chunk_size] for i in range(0, len(doc), chunk_size)]
@dlt.resource(
standalone=True,
write_disposition="merge",
merge_key=["doc_id", "chunk_hash"],
table_name="document",
)
def documents(docs: List[DictStrAny]) -> Generator[DictStrAny, None, None]:
for doc in docs:
doc_id = doc["doc_id"]
chunks = chunk_document(doc["text"])
embeddings = [
{
"chunk_hash": get_md5_hash(chunk),
"chunk_text": chunk,
"embedding": mock_embed(),
}
for chunk in chunks
]
yield {"doc_id": doc_id, "doc_text": doc["text"], "embeddings": embeddings}
pipeline = dlt.pipeline(
pipeline_name="chunked_docs",
destination="postgres",
dataset_name="chunked_documents", # dev_mode=True,
)
initial_docs = [
{
"text": "This is the first document. It contains some text that will be chunked and embedded. (I don't want "
"to be seen in updated run's embedding chunk texts btw)",
"doc_id": 1,
},
{
"text": "Here's another document. It's a bit different from the first one.",
"doc_id": 2,
},
]
load_info = pipeline.run(documents(initial_docs))
print(f"Initial load: {load_info}")
updated_docs = [
{
"text": "This is the first document, but it has been updated with new content.",
"doc_id": 1,
},
{
"text": "This is a completely new document that wasn't in the initial set.",
"doc_id": 3,
},
]
update_info = pipeline.run(documents(updated_docs))
print(f"Update load: {update_info}")
So when a document is updated, the old embeddings are deleted as well like you mentioned. Neat!
I'm still figuring out how the above could be altered for vector dbs without transactional semantics or sql facilities. I'd imagine this depends on the vector db's capabilities and some would do it better than others.
It goes without saying that vector dbs that fully support nosql merge operations or SQL-esque dbs (lancedb) with merge operations are the easiest. Otherwise we will probably have to rely on full loads of chunks, but with judicious calling of the embedding callback to keep costs minimal, perhaps by caching the embeddings of chunk_hash
s in old doc_id
first before truncating destination.
@rudolfix Some thoughts:
1) Can we provide the user with a cache so users don't need to re-embed similar chunks? If their documents change slightly then no need to send 95% of the same content to provider again and paying for it. If you reference my example, is there a decorator in dlt like @lru_cache that we can wrap mock_embed
in? Perhaps using dlt state as the kv store?
Feature description
We need a system to manage chunked documents and their embeddings efficiently across vector databases and SQL stores with vector extensions. The goal is to update and replace documents split into chunks without orphaning data or requiring full re-embedding.
Use case
When working with large documents in vector databases or SQL stores with vector capabilities, it's common to split these documents into smaller chunks for embedding. This allows for more granular similarity searches and better performance. However, this creates challenges when updating or replacing documents using dlt's merge operation:
Proposed Solution
Assuming the same provider and embedding model is used:
(doc_id, chunk_hash)
.