Closed choccccy closed 3 months ago
Yeah, as you reason through it, I agree. Default to 0.
The issue specifically is that the chunks are overlapping by about two "sections"/words when tested with " " as the seperator, so when I put in the classic lorem ipsum:
and assign a chunk size of 100 and overlap of 0, I would expect that my chunks would look like this:
but instead, when tested, they overlap by about 2 words (after the first and presumably before the last chunk):
I'm nudged by an article I'm working on to look at this. Here is my plan, in brief:
text_helper.text_split_fuzzy
function which is basically the same as text_helper.text_splitter
, and indeed maintain the latter an aliastext_helper.text_split
function which basically brooks no overlaptext_helper.token_splitter
class. We want this one to be a class to encapsulate details of e.g. embedding, and allow us to implement some memoization, etc.Actually, after another look, I'll create text_split_fuzzy
and text_split
as generators, then leave text_splitter
as before, though deprecated.
There are some situations where having lots of chunk overlap isn't super useful; being able to set it as 0 (or, perhaps, just not set it, and have it assume it should be 0) would be nice.
I've also thought that maybe the chunk overlap should default to a percentage of the chunk size.