Open zilto opened 4 weeks ago
@zilto thanks for this idea. a few random (probably I miss some background) comments:
dlt
ability to create child tables and yield in contexts
:
for contextualized in _contextualize([chunk["text"] for chunk in items]):
yield
{"context_id": context_id, "text": contextualized}, chunks: [list of chunks])
that would create contexts
and contexts__chunks
tables. the won't be as nice as your though (dlt would add its table linking)
btw. with @Pipboyguy we are trying to support chunking in some more or less unified way #1587
I agree with the motivation of the cited issue! But to add more context:
document -> chunks -> contexts
is more performant than document -> overlapping contexts -> chunks
because less total text is parsed into the smaller chunks and the joining operation is cheap.overlapping contexts -> chunks
should produce duplicated chunks and chunks might not be exact partitions that can recreate the original document as opposed to documents -> chunks
(e.g., when using a tokenizer instead of the naive string manipulation used in my example)chunks
are the smallest meaningful unit, storing it once can enable multiple different use case / pipelines that need bigger text chunks for downstream userschunks
should make it easier to manage state for incremental loading as it enables resuming a rolling window operation1.why not to chunk with rolling window from the very start? so chunks are already "overlapping". then you do not need to merge them in transformer
This suggests doing document -> contexts
instead of documents -> chunks -> contexts
. Unlike the approach I suggested, one can't know what two "contexts" have in common (e.g., what chunks they share, how many chunks they share, how far are the chunks they share (with ordered chunk ids)).
2.what is the role of "chunks", "contexts" and "chunk-to-context-keys" tables/collection when doing a RAG? are you using both chunks and contexts to identify documents via vector search?
For RAG, I intend to use "contexts" for first-pass vector search then use "context-chunk" lineage to filter out "contexts" that have "too much in common" and increase the information content of the text passed to the LLM. Over time, it's valuable to logging which "context" and underlying "chunk" are high signal for downstream uses.
More concretely, a user asks a question about dlt. You want documentation to be embedded in large "contexts" to have good recall, then the LLM should be able to extract the right info from the "context" and generate an answer. However, it's still fuzzy "what" was useful to the LLM or user. The above lineage would show that retrieving "contexts" with "chunk" dlt is an open source Python library
is high signal to answer questions around the topic of pricing.
3.what you could try is to use dlt ability to create child tables and yield in contexts:
Didn't think of that! While it handles relationships, I would have duplicated "chunks" stored, no?
@zilto it seems will be picking your brain a lot :) our goal is to support chunked documents with "merge" write disposition (where only subset of documents will be updated). I'll get back to this topic tomorrow. we need to move forward...
@zilto Thanks for the detailed use case and explanation!
@rudolfix I think this table can be created as part of a job as well to run after main table chain just like the current orphan removal process, and have its orphans removed in a similar fashion. WDYT
Feature description
Allow the LanceDB and other Vector DB adapter to specify a "contextualize" or rolling window operation to join partitioned text chunks before applying the embedding function.
Are you a dlt user?
Yes, I'm already a dlt user.
Use case
context
The constructs of
@dlt.resource
and@dlt.transformer
are very convenient for document ingestion for NLP/LLM use cases. The@dlt.resource
returns full-text and@dlt.transformer
can chunk it (into paragraphs for example). The LanceDB and other vector DB adapters make it easy to embed the full-text and the chunked text columns. We get something like this:Full-text
Chunks (3 words)
limitations
However, embedding these "partitioned" chunks is often low value for RAG. A common operation is "contextualizing" chunks, which consists of a rolling window operation (with window size and stride / overlap parameters). For instance LanceDB has contextualize(), but it requires converting the data to a pandas dataframe. Let's illustrate a "2-chunk window" based on the previous table:
Contexts
AFAIK, dlt doesn't provide a clear API for normalizing the
chunk_id
and thecontext_id
columns. The "contextualize" operation could be directly implemented in a single@dlt.transformer
, but it would only includedocument_id -> context_id
and miss the fact that "contextualized chunks" aren't independent; they share underlying chunks.Proposed solution
adding a "reducer" step
I was able to hack around to receive a batch of "chunks" and use
dlt.mark.with_table_name
to dispatch both a "context table" and "relation table" from the same@dlt.transformer
. Mock code:Contexts
Chunks-to-contexts keys
There's probably room for a generic
@dlt.reducer
that automatically manages the primary / foreign keys based on the other resources metadata, handles the key set hashing, and dispatches results to tables. Given that this could be a can of worm, it could be tested and refined while being hidden behind thelancedb_adapter
. The API could be expanded toThis would reproduce the above logic by creating the chunks table as defined by the user (
chunks
resource) and creating the second table automaticallyRelated issues
No response