benbrandt / text-splitter

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
MIT License
269 stars 16 forks source link

MarkdownSplitter: Return preceding headers #116

Open benbrandt opened 7 months ago

benbrandt commented 7 months ago

One benefit of having the extra Markdown structure, other than having better split points, is we can provide extra context to a given chunk from the headings that are relevant to a given chunk.

It would be great to have an alternate chunk method, that not only returns the chunk, but also any relevant context. Something like:

pub fn chunks_with_context<'splitter, 'text: 'splitter>(
    &'splitter self,
    text: &'text str,
    chunk_capacity: impl ChunkCapacity + 'splitter,
) -> impl Iterator<Item = (&'text str, Context)> + 'splitter;

Where Context is something like:

HashMap<HeadingLevel, &'text str>

with the corresponding header text of the most recent heading at each level.

This would traverse the document until it gets to the offset of a given chunk, keeping a reference to each level it encounters. But if it encounters a level it has already seen, then it will replace it with the new one and also remove any references to lower heading levels.

Todo:

jackbravo commented 3 months ago

This sounds very interesting, and in line with this article that mentions this should improve relevancy of chunks and accuracy of results:

https://d-star.ai/solving-the-out-of-context-chunk-problem-for-rag

The example is very illustrative:

We’ll use Nike’s 2023 10-K to illustrate this. Here are the first 10 sections we identified:

image

Add contextual chunk headers

image

The purpose of the chunk header is to add context to the chunk text. Rather than using the chunk text by itself when embedding and reranking the chunk, we use the concatenation of the chunk header and the chunk text, as shown in the image above. This helps the ranking models (embeddings and rerankers) retrieve the correct chunks