Slower performance compared to Langchain's RecursiveCharacterTextSplitter

shoang22 commented 3 weeks ago

Thank you for the amazing work. I'm currently using Langchain's RecursiveCharacterTextSplitter to generate chunks for completion. It is currently the bottleneck of my application, so I've been looking for a faster alternative. I tried testing it against TextSplitter, but RecursiveCharacterTextSplitter somehow performs better.

I'm testing with pytest using the code below:

import time 

import tiktoken
from semantic_text_splitter import TextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

def count_tokens(text):
    counter = tiktoken.get_encoding("cl100k_base")
    return len(counter.encode(text))

def test_splitter():
    chunk_size = 10_000
    chunk_overlap = int(max(chunk_size * 0.10, 200))

    with open("tests/files/world192.txt", "r") as f:
        text = f.read()[:10_000_000]

    n_tokens = count_tokens(text)
    print(f"N_tokens: {n_tokens}")

    splitters = [
        TextSplitter.from_callback(
            lambda text: count_tokens(text), 
            capacity=chunk_size,
            overlap=chunk_overlap,
        ).chunks,
        RecursiveCharacterTextSplitter(
            length_function=count_tokens,
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
        ).split_text,
    ]

    for splitter in splitters:
        b = time.perf_counter()
        text_chunks = splitter(text)        
        e = time.perf_counter()
        print(f"Time for {splitter.__name__} to complete: {e - b} seconds yielding {len(text_chunks)} splits.")

        for chunk in text_chunks:
            assert count_tokens(chunk) <= chunk_size

I obtained the following results:

N_tokens: 658023
Time for chunks to complete: 20.12398294599916 seconds yielding 76 splits.
Time for split_text to complete: 0.6139735539982212 seconds yielding 74 splits.

Is this expected behavior? Is there anything I can do to speed things up?

Test file: world192.txt

benbrandt commented 3 weeks ago

Hi @shoang22 thanks for reporting. There is indeed a performance issue I think I triaged here: https://github.com/benbrandt/text-splitter/issues/184

I think it likely related, and I didn't manage to finish the fix the other night, but I am hoping I can wrap it up this weekend.

Another thing: Can you try changing the splitter to use the from_tiktoken_model constructor?

TextSplitter.from_tiktoken_model(
    "gpt-4", 
    capacity=chunk_size,
    overlap=chunk_overlap,
).chunks,

This will use the same encoding, but will allow all of the computation to stay on the Rust side, and it won't have to cross the FFI boundary on every encoding call. I don't think it will bring it all the way down, but it will likely help.

I will also create an issue to create a from_tiktoken_encoding method so that you can also just use the encoding name as well.

Let me know if that helps, and I will also let you know once I have a new release. Thanks!

shoang22 commented 3 weeks ago

@benbrandt There was definitely a performance improvement. Although it still took significantly longer than RecursiveCharacterTextSplitter (10 seconds) as you predicted.

The main reason I wanted to stick with .from_callback is that we use Anthropic models as well, so we have to use a different token counter in those cases.

benbrandt commented 3 weeks ago

Fair enough @shoang22 . I'll try and hurry up the fix. Basically binary search helped in most cases, but I think in worst case regressed quite a bit. Since tokenization is slower with longer strings, and some of the optimizations didn't help in all cases.

I also need to inspect some of the bindings to Python, because while I am doing more work than the recursive Langchain one (hopefully with better results) I am still a little suspicious that the bindings might be doing something not optimal in addition.

benbrandt commented 2 weeks ago

Hi @shoang22 I just released v0.14.0 with a potential performance improvement, especially for large documents.

I am not sure if it will solve it for your document (my test locally showed that it might still be roughly the same performance but let me know).

I am definitely doing a lot more steps and checks to produce what I feel are the most "correct" results, but this might be coming at a significant performance hit in some cases, and I definitely want to dig in more to this particular document and see where the time is being spent.

I will leave this open as a reminder to do some more digging, as it is always helpful to have a specific document, so thanks!

shoang22 commented 2 weeks ago

Thanks for the update! Running the tests again with .from_callback, I get the following result:

Time for langchain_text_splitters.character.split_text to complete: 0.6163136860000122 seconds yielding 74 splits.
Time for None.chunks to complete: 9.998086693999994 seconds yielding 76 splits.  <----- TextSplitter

Awesome to see that there was a performance improvement, but it is still slower. We ended up using a more naive splitter that splits the document into n-sized tokens. Would love to see where you take this project though, please keep us posted!

ShelbyJenkins commented 1 week ago

Howdy, also reporting slow speeds. I've actually just finished up a rust chunker of my own, and I'm doing final testing.

TextChunker duration without overlap: 1.216322342s
TextSplitter duration without overlap: 13.211401934s
TextChunker duration with overlap: 3.045341955s
TextSplitter duration with overlap: 14.993658968s

This is splitting Jack London's 'Call of the Wild'.

        let config = ChunkConfig::new(max_chunk_token_size as usize)
            .with_trim(true)
            .with_sizer(tiktoken_tokenizer)
            .with_overlap(overlap as usize)
            .unwrap();
        let splitter = TextSplitter::new(config);
        splitter
            .chunks(incoming_text)
            .map(|s| s.to_string())
            .collect::<Vec<String>>()

I assumed this library was super well optimized, and I think it is from reviewing the code. My approach attempts to balance the chunks so they're all the same size, which has severe performance impact. So I'm very surprised to see such a difference. I plan on writing a blog post about my project, and I want to make sure I'm using this project correctly because I think it's still a very useful tool and planned to recommend it as a faster alternative to mine + support for code and markdown.

I planned on releasing my project tomorrow once I finish the documentation updates, but I can hold on the blog post until we get this sorted here.

benbrandt commented 1 week ago

Hi @ShelbyJenkins thanks for reporting. I think there is indeed something going on here that I haven't been able to pinpoint yet. I would be very happy to look over your implementation if you like, and I will be taking another look at this this weekend to see if there is something I can do to speed things up.

ShelbyJenkins commented 1 week ago

@benbrandt I just published my implementation here: https://github.com/ShelbyJenkins/llm_utils/blob/main/src/text_utils/chunking/mod.rs

I wrote a comparison test here: https://github.com/ShelbyJenkins/llm_utils/blob/main/src/text_utils/chunking/external_text_chunker.rs

Speed comparison for content: The ACM A. M. Turing A...
 content token_count: 284
absolute_length_max: 128
overlap_percent: 0.166666

TextChunker Results:
duration without overlap: 25.915969ms
duration with overlap: 31.307499ms

TextSplitter Results:
duration without overlap: 506.592011ms
duration with overlap: 511.676238ms

Speed comparison for content: Doom (stylized as DOOM...
 content token_count: 1244
absolute_length_max: 256
overlap_percent: 0.166666

TextChunker Results:
duration without overlap: 71.252411ms
duration with overlap: 100.032202ms

TextSplitter Results:
duration without overlap: 549.703887ms
duration with overlap: 594.522629ms

Speed comparison for content: Who Controls OpenAI?
M...
 content token_count: 4828
absolute_length_max: 512
overlap_percent: 0.166666

TextChunker Results:
duration without overlap: 153.630351ms
duration with overlap: 592.726511ms

TextSplitter Results:
duration without overlap: 985.372199ms
duration with overlap: 1.163097766s

Speed comparison for content: The Call of the Wild
b...
 content token_count: 39823
absolute_length_max: 1024
overlap_percent: 0.166666

TextChunker Results:
duration without overlap: 1.173179426s
duration with overlap: 3.041184723s

TextSplitter Results:
duration without overlap: 2.956028608s
duration with overlap: 3.861146983s

Speed comparison for content: MEDITATIONS
By Marcus ...
 content token_count: 78479
absolute_length_max: 2048
overlap_percent: 0.166666

TextChunker Results:
duration without overlap: 2.317496042s
duration with overlap: 5.916404579s

TextSplitter Results:
duration without overlap: 5.616274529s
duration with overlap: 7.606386683s

I did see a large speed-up after updating to the latest version today. BTW, I was originally using the romeo and juliet text you have, but I started thinking that the newlines at the end of every line for the ebook formatting were not going to accurately test chunking. I might be wrong, but just a hunch I had.

Interested in seeing if these results are useful for you.

benbrandt / text-splitter

Slower performance compared to Langchain's RecursiveCharacterTextSplitter #223