bhavnicksm / chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
https://pypi.org/project/chonkie/
MIT License
1.55k stars 57 forks source link

`TokenChunker` does not support multiple inputs #18

Closed not-lain closed 1 week ago

not-lain commented 1 week ago

Issue

I ran the following example provided in the readme file

# First import the chunker you want from Chonkie 
from chonkie import TokenChunker

# Import your favorite tokenizer library
# Also supports AutoTokenizers, TikToken and AutoTikTokenizer
from tokenizers import Tokenizer 
tokenizer = Tokenizer.from_pretrained("gpt2")

# Initialize the chunker
chunker = TokenChunker(tokenizer)

# Chunk some text
chunks = chunker("Woah! Chonkie, the chunking library is so cool!",
                  "I love the tiny hippo hehe.")

# Access chunks
for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")

and I was running into the following error

TypeError                                 Traceback (most recent call last)
[<ipython-input-2-bb5f7fdb45bb>](https://localhost:8080/#) in <cell line: 13>()
     11 
     12 # Chunk some text
---> 13 chunks = chunker("Woah! Chonkie, the chunking library is so cool!","I love the tiny hippo hehe.")
     14 
     15 # Access chunks

TypeError: BaseChunker.__call__() takes 2 positional arguments but 3 were given

extra information

I would suggest either updating the example on the readme file or updating the BaseChunker to support multiple inputs at the same time. the latter is my go-to suggestion since it can process multiple samples at the same time, we can either support lists here or args, preferably lists since the tokenizers library already supports lists already.

bhavnicksm commented 1 week ago

Hey @not-lain,

WOAH 😳 that's a bit embarrasing, haha I'll fix the example in the README.md, right now!

Regarding adding batching/list support, I plan to add multiprocessing support (via MPIRE) soon, so we can run these parallely~! Multiprocessing because I want Chonkie to be the fastest even with Batching.

Would really appreciate PRs if you're willing to work on this.

not-lain commented 1 week ago

On it 🫡

bhavnicksm commented 1 week ago

Hey @not-lain!

We can probably add a method to the BaseChunker class, named chunk_batch that can run chunk via multiprocessing. So whenever we add new Chunkers in the future, we don't need to re-implement the chunk_batch function.

And we can expose the num_proc for the chunk_batch as an optional parameter on the method.

How does that sound?

bhavnicksm commented 1 week ago

Hey @not-lain,

Just added initial support for batching in the BaseChunker via multiprocessing library in python in #28, this is definitely not the most optimal way to go about chunking but is merely to serve as a placeholder for when we build more optimal chunking approaches.

I'd be happy to accept PRs for "native" batching approaches in TokenChunker and other chunkers, that work without multiprocessing.

For now, I think we can close this issue and make different issues for "native" batching support on the various chunkers.

Thanks 😊

not-lain commented 1 week ago

Awesome, was thinking of doing this over the weekend, but glad it was already implented. Nice work 🙌