benbrandt / text-splitter

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
MIT License
244 stars 15 forks source link

Balanced Chunks #117

Open benbrandt opened 5 months ago

benbrandt commented 5 months ago

Motivation:

Right now, the chunking uses a greedy algorithm. The following would output the following chunks:

let text = "Sentence 1. Sentence 2. Sentence 3. Sentence 4.";
splitter.chunks(text, 36) == ["Sentence 1. Sentence 2. Sentence 3.", "Sentence 4."]

This may not always be desirable, since it can leave "orphaned" elements at the end.

In some cases it would be better to realize at this semantic level there is a more ideal split of:

["Sentence 1. Sentence 2.", "Sentence 3. Sentence 4."]

Finding this is not straightforward in all cases, I attempted it in the past, but at least that attempt led to the algorithm generating several more chunks, rather than finding the best split point. Because tokenization isn't always predictable, there may need to be some allowance for extra chunks being generated, but ideally we can find good split points within the current number of chunks.

Todo:

do-me commented 4 months ago

Hi Ben! First of all a big thank you for this package. It is taking away the pain from poorly performing Python-based frameworks and exactly what I was looking for.

I think what you describe here with "balanced chunks" is actually something that nowadays is called "semantic chunking" (not to be confused with purely punctuation- or grammar-based semantics that you already implemented) with strategies like MMR, RAPTOR or combinations of those. I wrote about this in more detail here https://github.com/do-me/SemanticFinder/discussions/45 but the core idea really boils down to:

In a nutshell you use sliding chunk windows and calculate their embeddings and their distance to each other. Using this score you are trying to identify semantic "break points" i.e. where you have the largest delta.

Imho if you spin your idea further of splitting text more "human-like" then you would necessarily end up with a similar strategy at some point. This is of course very computation-heavy and would outshine the great performance-gains that you achieve here. Maybe it's also out of scope as Langchain already implemented it. What is your opinion about this?

benbrandt commented 4 months ago

Hi @do-me thanks for this. I had this more scoped here: https://github.com/users/benbrandt/projects/2?pane=issue&itemId=57419637

Basically this issue is more that the current algorithm is "greedy", but if we know we are within a certain semantic level, I think there might be a way to try and spread the content out more evenly between the chunks of the level.

Does that make sense?

I do agree that all of the semantic embedding approaches are interesting. I am hopeful my crate can be helpful in terms of making sure that all of the chunks being embedded are within the context window of the embedding model, and allowing for a two-pass approach... but it all needs to be explored :)

do-me commented 4 months ago

The URL leads me to an empty description somehow :D image

However, thanks for clarifying, that makes a lot of sense. So if I get you right, you would say all these fancy heuristics are out-of-scope for text-splitter as they'd introduce a lot of overhead.

benbrandt commented 4 months ago

That is because I haven't really scoped it out yet 😆 I don't think it is out-of-scope, but will likely be an opt-in. I think I would provide both options so people can make the tradeoff on performance vs potential quality improvement.

That being said, I want to get a few other features done first, namely chunk overlap and Code splitting, and then I would probably reassess priority from there. Hopefully that makes sense?

do-me commented 4 months ago

Sure, sounds really exciting! :) Looking forward to your future developments!

FullMetalMeowchemist commented 3 months ago

This would be a great feature.

Maybe food for thought... Previously I'd have a bunch of documents that I'd pre-calcuate the total (token) size of those documents, find the outliers, then chunk them down using the median of the non-outlier distribution. This way, the outliers get merged into the bell curve.

Obviously you wouldn't want to chunk twice and "waste" the compute of the first cycle. So I think you can take the total size of the text, and make an educated guess on how to make balanced chunks. I think it's pretty much implicit if you turn on "balanced chunking", the chunking boundary would no longer be exactly the user provided token or character size.

ShelbyJenkins commented 2 months ago

If anyone is curious, I implemented this in my crate. I had previously done this in python last year and submitted it as a PR to langchain, but it went no where.

I wrote a comparison test here.

Balance comparison for content: The ACM A. M. Turing A...
 content token_count: 284
absolute_length_max: 128
overlap_percent: 0.166666

TextChunker result without overlap:
chunk_count: 3
avg_token_size: 95
largest_token_size: 97
smallest_token_size: 92

TextChunker result with overlap:
chunk_count: 3
avg_token_size: 112
largest_token_size: 114
smallest_token_size: 112

TextSplitter result without overlap:
chunk_count: 3
avg_token_size: 94
largest_token_size: 123
smallest_token_size: 45

TextSplitter result with overlap:
chunk_count: 3
avg_token_size: 95
largest_token_size: 103
smallest_token_size: 87

Balance comparison for content: Doom (stylized as DOOM...
 content token_count: 1244
absolute_length_max: 256
overlap_percent: 0.166666

TextChunker result without overlap:
chunk_count: 5
avg_token_size: 251
largest_token_size: 255
smallest_token_size: 249

TextChunker result with overlap:
chunk_count: 6
avg_token_size: 233
largest_token_size: 251
smallest_token_size: 225

TextSplitter result without overlap:
chunk_count: 7
avg_token_size: 177
largest_token_size: 247
smallest_token_size: 135

TextSplitter result with overlap:
chunk_count: 9
avg_token_size: 142
largest_token_size: 178
smallest_token_size: 94

Balance comparison for content: Who Controls OpenAI?
M...
 content token_count: 4828
absolute_length_max: 512
overlap_percent: 0.166666

TextChunker result without overlap:
chunk_count: 11
avg_token_size: 435
largest_token_size: 498
smallest_token_size: 389

TextChunker result with overlap:
chunk_count: 12
avg_token_size: 457
largest_token_size: 489
smallest_token_size: 446

TextSplitter result without overlap:
chunk_count: 12
avg_token_size: 401
largest_token_size: 509
smallest_token_size: 152

TextSplitter result with overlap:
chunk_count: 16
avg_token_size: 321
largest_token_size: 420
smallest_token_size: 111

Balance comparison for content: The Call of the Wild
b...
 content token_count: 39823
absolute_length_max: 1024
overlap_percent: 0.166666

TextChunker result without overlap:
chunk_count: 46
avg_token_size: 865
largest_token_size: 1008
smallest_token_size: 766

TextChunker result with overlap:
chunk_count: 51
avg_token_size: 912
largest_token_size: 962
smallest_token_size: 894

TextSplitter result without overlap:
chunk_count: 43
avg_token_size: 926
largest_token_size: 1024
smallest_token_size: 607

TextSplitter result with overlap:
chunk_count: 60
avg_token_size: 757
largest_token_size: 852
smallest_token_size: 440

Balance comparison for content: MEDITATIONS
By Marcus ...
 content token_count: 78479
absolute_length_max: 2048
overlap_percent: 0.166666

TextChunker result without overlap:
chunk_count: 46
avg_token_size: 1703
largest_token_size: 1957
smallest_token_size: 1536

TextChunker result with overlap:
chunk_count: 50
avg_token_size: 1819
largest_token_size: 2020
smallest_token_size: 1782

TextSplitter result without overlap:
chunk_count: 43
avg_token_size: 1825
largest_token_size: 2047
smallest_token_size: 774

TextSplitter result with overlap:
chunk_count: 60
avg_token_size: 1518
largest_token_size: 1705
smallest_token_size: 712

The smallest_token_size tells the story.

As far as computationally heavy goes, yes I assumed it would be much slower. I fought hard to get it anywhere near the benchmarks for this library, and it was still 10x slower than the benchmarks posted in some cases. (Although there seems to be an issue currently with text-splitter and tiktoken where my implementation is faster.) Once resolved, I don't think it would be possible to make it in the same ballpark as fast as this library would be. However, the orphan chunk issue is a deal breaker for me for text. That said, this library is much more polished, and the code and markdown splitter are things I will be using.

Happy to discuss.