Open do-me opened 6 months ago
Hi @do-me I have been thinking about this, and actually I think what you are doing is exactly what I would do as well.
The issue is, because of how tokenization works, it isn't easy to properly split the work in an efficient manner that will produce the exact same chunks, because the chunking is greedy.
It is possible there is a way to do this that will preserve the behavior, but all of the my initial sketches this weekend have potential issues, either in still not utilizing enough cores, or producing different chunks, or doing significantly more work to "fix" the behavior.
If I can find a way to properly split all of the chunking and use the cores, this would definitely be nice. But I also think that splitting the work by chunking each document as a separate task, rather than lower in the chunking, is what I would often do in most cases anyway, as it is a nice chunk of work to parallelize.
In regards to if this should happen on the Python or Rust side, I think in this case, it won't matter too much. Because at some point the allocated strings have to come across the FFI boundary. Unless I find a more efficient way to handle this, I think it won't make a large difference, and I think it would be better for now to leave the flexibility up to the user in how they want to compose this.
For example, I wouldn't want to tie the parallelization to pandas or something like this, and it doesn't seem like too much work to mange this (you might even be able to do you wrapping function as a lambda
and it would all be on one line)
I'll put this in the backlog so I don't lose track of it, but for now I think you are doing the right and most efficient thing 👍🏻
Thanks for sharing your ideas. I see, that's what I expected with tokenization and that's totally understandable. I think it's a good decision.
By the way, here is the (more or less) one-liner if you want to use multiprocessing without pandas/swifter: https://gist.github.com/do-me/bab94723b35827763789cdd4ff1dc698
Anyway, I was thinking a little ahead about my use case with the text-splitter wasm version as I always have one document, but this document might be extremely large, like containing the text of a book or a couple of books.
What would you recommend in this case?
I have two heuristics in mind:
My simple idea would be e.g. if the user has 4 cores available, then try to identify the 3 biggest semantic breaks in the text, for example multiple line breaks. Then run 4 instances of text-splitter in parallel.
Another heuristic would be to simply count chars, divide by 4, and snap to the closest semantic break we can find close by like a sentence end or similar. Of course we might run the risk that the results are slightly different from what a greedy heuristic would output, but at least for my use case I wouldn't need perfect accuracy.
I am using the Python bindings and noticed that text-splitter is running on one core only. I think it would be great to allow for an option to use all available CPU cores (at least for character-based splitting if using tokenizers would add too much complexity).
As a workaround I am currently using pandarallel for my pandas dataframe:
It requires the wrapping function to be able to hash it. On my 16 core machine it makes a big difference of 7 mins single-core processing vs 45 secs multi-core processing!
I don't know what's more efficient:
Happy to test in case!