benbrandt / text-splitter

Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
MIT License
235 stars 15 forks source link

Heuristics for very large documents #184

Closed do-me closed 2 months ago

do-me commented 3 months ago

Problem

I've been working on legal documents lately and indexing 300k documents. Everything is going perfectly fine with normal-sized docs (dozens of pages). However, when documents become very large like the example below with 86.000.000 characters it takes an eternity. I actually quit the process after 1h of processing and have no clue how long it might even take. Will let it run overnight and see whether it works eventually. The one CPU core used is at 100% so I take this as a sign, that the code is not unexpectedly failing or similar.

Possible solutions

  1. Is there maybe some bottleneck in the code somewhere where complexity explodes? Considering for how fast text-splitter works for normal-sized docs, it feels like this should be doable in a couple of seconds.
  2. Could we maybe apply some additional heuristics here for dealing with very large docs?
  3. Maybe point 2 in combination with multiprocessing #165 might be an idea?

Here is my example (37Mb parquet file):

from semantic_text_splitter import TextSplitter
import pandas as pd 
df = pd.read_parquet("free_trade_agreement.parquet")
splitter = TextSplitter((1500,2000)) 
test_chunks = splitter.chunks(df.text.item())

Couldn't upload the file here in the issue so I'm hosting it on my Drive here.

In case, do you have any other ideas how to make it work?

do-me commented 3 months ago

Update: it ran all night without success. Seemed to be still running this morning, so I aborted.

benbrandt commented 3 months ago

Hmm ok I am traveling today but I will try and take a look this weekend. It does seem strange, and I'm not sure if there is something about loading it from Parquet that is causing an issue... But I assume not since the others are working fine.

benbrandt commented 3 months ago

Hmm also the rust crate assumes the entire string is in memory. The rust crate at least only ever uses references to the single string, but on the Python side it eventually has to allocate a string for every chunk so that Python owns them... That may be another issue.

Just trying to think about what could be different since I assume it is making some progress... I'll also try and see if it runs any differently straight from Rust rather than through the Python bindings.

do-me commented 3 months ago

It's not about the parquet file, I just exported it as parquet for you for convenience. Yes, also the other texts are working fine. Seems really to be about the size of the text that somehow causes something to fail or rather to blow up the processing exponentially. Also it fits nicely in memory as it's only 37Mb.

benbrandt commented 3 months ago

Ok good to know, just wasn't something I had tried before so I wasn't sure. Thanks!

benbrandt commented 3 months ago

@do-me So I have been doing some tests. If I bump the chunk size up enough, it does finish, and it does some to progress even at smaller sizes, including what you posted. I ran it in Rust so I could output incremental progress with each chunk, and it is moving, just slowly.

The issue I am seeing is that there are no newlines anywhere in this file. So for each chunk, it is binary searching across the entire file at sentence granularity, which at this size is indeed unfortunate...

I will need to look into this some more. I make an assumption in some areas that I found a granularity that was too big to fit, and do a binary search for sections of the lower granularity, but only up to the byte offset of the end of the granularity I know is too big.

But I am now thinking that it might be better to only do binary search if I can reasonably limit the search space. If I can't, then I am searching across the entire document, and due to the expensive nature of tokenization, it is likely binary search can become quite expensive, and may not deliver the speed improvement it does if I can limit the search space.

I could also change the binary search to not be a strict binary search... I will have to play around with it... I will give this a try, and run my benchmarks again in the happy path where I often do have an upper bound to the search space, and make sure there aren't any regressions.

do-me commented 3 months ago

I see, thanks for looking into this! So from what you describe, this could happen on other semantic levels as well.

The logic to stream the already identified chunks would be awesome! Especially for a potential wasm version as I could already start to calculate embeddings!

Indeed, we will work on the data side as wel and try to include the line breaks.

benbrandt commented 2 months ago

@do-me just want to give you a heads up so you know I didn't forget. I've been trying dozens of implementations, all with different tradeoffs, but I think I found one I am happy with.

I was able to split this file in less than a second with my latest branch 🎉 I have to do some cleanup/ release preparation but I hope to get this out soon!

do-me commented 2 months ago

Awesome @benbrandt, thank you very much for your efforts! I'll need to run some bigger processing tasks in the next couple of weeks so I'll come back here by then and let you know about my experiences :)

Performance: With the latest release semantic-text-splitter-0.14.0 processing now takes 2.8s on my M3 for this doc.

CPU times: user 2.79 s, sys: 28.9 ms, total: 2.82 s
Wall time: 2.83 s