dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.74k stars 181 forks source link

Efficient Update Strategy for Chunked Documents in Vector Databases #1533

Open Pipboyguy opened 5 months ago

Pipboyguy commented 5 months ago

Feature description

We need a system to manage chunked documents and their embeddings efficiently across vector databases and SQL stores with vector extensions. The goal is to update and replace documents split into chunks without orphaning data or requiring full re-embedding.

Use case

When working with large documents in vector databases or SQL stores with vector capabilities, it's common to split these documents into smaller chunks for embedding. This allows for more granular similarity searches and better performance. However, this creates challenges when updating or replacing documents using dlt's merge operation:

Proposed Solution

Assuming the same provider and embedding model is used:

rudolfix commented 4 months ago

@Pipboyguy the biggest challenge is to update the document in the database effectively (merge write disposition). if we chunk a document, should we yield many objects (documents) per chunk or we yield a single compound object with all the chunks inside?

for the latter we already solve this problem for sql databases (child table merge) so the same mechanism could be reused if vector database supports SQL (ie. pgvector).

for other vector databases we'd need to drop all the previous chunks and then re-insert all new chunks (or something more optimized) but I have some doubts if that is scalable?

any thoughts on that?

Pipboyguy commented 4 months ago

@rudolfix Thank you for providing some specificity and color to the issue.

I came up with a method (see updated issue description), where only the chunks are hashed according to their contents, i.e. merge key (doc_id, chunk_hash).

What has worked in my experimentation is yielding a single compound object for each document, with all chunks inside. This specific combination, at least with SQL databases, groups the embeddings neatly in a child table. (I'd imagine there are other creative ways to do this as well)

Vector DBs With SQL Support

It works on any sql store with delete-insert strategy:

import hashlib
import random
from typing import List, Generator

import dlt
from dlt.common.typing import DictStrAny

def get_md5_hash(data: str) -> str:
    return hashlib.md5(data.encode("utf-8")).hexdigest()

def mock_embed(dim: int = 10) -> str:
    # For purpose of illustration I return string here instead of compound (list) type here to prevent
    # further normalization. this will usually be a vector/numpy indivisible type in target db so it won't be normalized then.
    return str([random.uniform(0, 1) for _ in range(dim)])

def chunk_document(doc: str, chunk_size: int = 10) -> List[str]:
    return [doc[i : i + chunk_size] for i in range(0, len(doc), chunk_size)]

@dlt.resource(
    standalone=True,
    write_disposition="merge",
    merge_key=["doc_id", "chunk_hash"],
    table_name="document",
)
def documents(docs: List[DictStrAny]) -> Generator[DictStrAny, None, None]:
    for doc in docs:
        doc_id = doc["doc_id"]
        chunks = chunk_document(doc["text"])
        embeddings = [
            {
                "chunk_hash": get_md5_hash(chunk),
                "chunk_text": chunk,
                "embedding": mock_embed(),
            }
            for chunk in chunks
        ]
        yield {"doc_id": doc_id, "doc_text": doc["text"], "embeddings": embeddings}

pipeline = dlt.pipeline(
    pipeline_name="chunked_docs",
    destination="postgres",
    dataset_name="chunked_documents",  # dev_mode=True,
)

initial_docs = [
    {
        "text": "This is the first document. It contains some text that will be chunked and embedded. (I don't want "
        "to be seen in updated run's embedding chunk texts btw)",
        "doc_id": 1,
    },
    {
        "text": "Here's another document. It's a bit different from the first one.",
        "doc_id": 2,
    },
]

load_info = pipeline.run(documents(initial_docs))
print(f"Initial load: {load_info}")

updated_docs = [
    {
        "text": "This is the first document, but it has been updated with new content.",
        "doc_id": 1,
    },
    {
        "text": "This is a completely new document that wasn't in the initial set.",
        "doc_id": 3,
    },
]

update_info = pipeline.run(documents(updated_docs))
print(f"Update load: {update_info}")

image image

So when a document is updated, the old embeddings are deleted as well like you mentioned. Neat!

Vector DBs Without SQL Support

I'm still figuring out how the above could be altered for vector dbs without transactional semantics or sql facilities. I'd imagine this depends on the vector db's capabilities and some would do it better than others.

It goes without saying that vector dbs that fully support nosql merge operations or SQL-esque dbs (lancedb) with merge operations are the easiest. Otherwise we will probably have to rely on full loads of chunks, but with judicious calling of the embedding callback to keep costs minimal, perhaps by caching the embeddings of chunk_hashs in old doc_id first before truncating destination.

@rudolfix Some thoughts:

1) Can we provide the user with a cache so users don't need to re-embed similar chunks? If their documents change slightly then no need to send 95% of the same content to provider again and paying for it. If you reference my example, is there a decorator in dlt like @lru_cache that we can wrap mock_embed in? Perhaps using dlt state as the kv store?