Closed not-lain closed 1 week ago
Hey @not-lain,
WOAH 😳 that's a bit embarrasing, haha I'll fix the example in the README.md, right now!
Regarding adding batching/list support, I plan to add multiprocessing support (via MPIRE) soon, so we can run these parallely~! Multiprocessing because I want Chonkie to be the fastest even with Batching.
Would really appreciate PRs if you're willing to work on this.
On it 🫡
Hey @not-lain!
We can probably add a method to the BaseChunker
class, named chunk_batch
that can run chunk
via multiprocessing. So whenever we add new Chunkers
in the future, we don't need to re-implement the chunk_batch
function.
And we can expose the num_proc
for the chunk_batch
as an optional parameter on the method.
How does that sound?
Hey @not-lain,
Just added initial support for batching in the BaseChunker
via multiprocessing library in python in #28, this is definitely not the most optimal way to go about chunking but is merely to serve as a placeholder for when we build more optimal chunking approaches.
I'd be happy to accept PRs for "native" batching approaches in TokenChunker
and other chunkers, that work without multiprocessing.
For now, I think we can close this issue and make different issues for "native" batching support on the various chunkers.
Thanks 😊
Awesome, was thinking of doing this over the weekend, but glad it was already implented. Nice work 🙌
Issue
I ran the following example provided in the readme file
and I was running into the following error
extra information
I would suggest either updating the example on the readme file or updating the
BaseChunker
to support multiple inputs at the same time. the latter is my go-to suggestion since it can process multiple samples at the same time, we can either support lists here or args, preferably lists since the tokenizers library already supports lists already.